For the class of replacement algorithms known as stack algorithms, existing analysis techniques permit the computation of memory miss ratios for all memory sizes simultaneously in one pass over a memory reference string. We extend the class of computations possible by this methodology in two ways. First, we show how to compute the effects of copy-backs in write-back caches. The key observation here is that a given block is clean for all memory sizes less than or equal to C blocks and is dirty for all larger memory sizes. Our technique permits efficient computations for algorithms or systems using periodic write-back and/or block deletion. The second extension permits stack analysis simulation for sector (or subblock) caches in which a sector (associated with an address tag) consists of subsectors (or subblocks) that can be loaded independently.
INTRODUCTION
Analysis of memory system performance is one of the most important aspects of computer system design. Frequently this analysis is done using the technique of trace-driven simulation, in which a trace of memory references from a similar Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery.
To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1989 ACM 0734-2071/89/0200-0078 $01.50 system is used as input to a simulation of the system under study. This technique has been applied to all levels of the memory hierarchy from microprocessor caches to file system design. Trace-driven simulation became much more practical with the discovery by Mattson, Gecsei, Slutz, and Traiger [22] that, for certain replacement policies, the performance of all memory sizes could be determined with a single pass through the trace file. Their stack analysis technique relies on the inclusion property of these policies, such that the contents of any size memory is a subset of the contents of any larger memory. Thus, the contents of all memories can be represented by a stack, where the top k items in the stack are the contents of a memory of size k. Policies that obey this property are known as stack algorithms. An equivalent characteristic of stack algorithms is that each possesses a total priority ordering of all blocks at any instant in time that is independent of memory size.
Until now, stack analysis has not been applied to some important situations, forcing the designer to fall back on the one-size-at-a-time method. One example of this is the write-back policy, also called copy-back, where a write to a block causes the block to be marked "dirty" in the cache, but the write to secondary storage is delayed until some later time. Write-back is particularly desirable where memory bandwidth may be a limiting resource, such as in a shared-bus multiprocessor or network file system. The alternative policy, write-through or store-through, in which all modifications go directly to secondary storage, severely restricts the performance improvement due to caching. The consideration of writes is important even with write-back since, in many cases, a write can cause twice the memory accesses of a read; one to fetch the block prior to modification and another to rewrite the block (copy-back). Discussions of stack analysis, however, have either ignored writes altogether, or considered only write-through.
The problem with stack analysis of write-back is that it appears to violate the inclusion property. For example, suppose that a dirty block at level k in the stack is read. It must come to the top of the stack, but it is now clean for some sizes and dirty for others. We show that by maintaining a "dirty level" for each block, the stack analysis technique can be extended to analyze write-back. This dirty level is the smallest cache in which the block is dirty, or equivalently, the lowest level in the stack to which the block has been pushed since its last write, or infinity if the block has never been written. The dirty level is used to count the number of write-backs for each cache size by assuming that each write results in a write-back, then waiting to see where the write is avoided. If a dirty block is rewritten, then the previous write has been avoided in all dirty sizes since both the previous and current write can be written back with one memory access.
Stack analysis can be similarly extended to analyze sector or subblock caches [14, 211. In a microprocessor cache, access time, on-chip data path widths, and pin-counts favor small blocks, whereas the size of associative lookup circuitry and tags favor fewer, larger blocks. A possible compromise is to break each block into independent subblocks, any or all of which may be present. Again, we show that stack analysis can be applied by maintaining additional data with each block.
The remainder of this section is a review of previous stack analysis techniques and definition of terms. Section 2 presents the stack algorithm for write-back, policies include the Least Recently Used (LRU) policy, First-In First-Out (FIFO), Least Frequently Used (LFU), and Random (RAND). An optimal policy, MIN, exists, but is unrealizable in practice because it requires knowledge of the future [22] . The MIN policy does not consider writes or deletes and is known to be nonoptimal if writes are considered [38] .
Write Policy. The write policy determines when a modification is presented to secondary storage. Writes may always go directly to secondary storage using the write-through or store-through policy. Alternatively, the write may go to the cache to be written at some later time, usually when the block is about to be replaced, using the write-back or copy-back policy. Write-back is motivated by the expectation that the block will be modified several times before it has to be written. Clearly, write-back can never cause more accesses than write-through and usually far fewer. On the other hand, since it deals in blocks rather than words, writeback may increase the number of bytes written. In addition, dirty blocks may remain in the cache for a long time, leading to reliability issues in large volatile caches such as file system caches in main memory. The decrease in memory traffic from write-back makes it very valuable in systems with limited memory bandwidth such as shared-bus multiprocessor systems. Write-back is also desirable in file system caches because many files are temporary and may neuer have to be written.
Write Allocate. When a written block is not present in a write-through cache, the block may be inserted in the cache (write allocate) or the cache may be bypassed altogether. Write allocate is again motivated by locality-the expectation that the written block will soon be referenced again. A write-back cache always allocates a cache block to the written block.
Write-Fetch. If write allocate is used by a cache where partial-block modification is allowed, and the block to be written is not in the cache (a write miss), then it is usually necessary to fetch the block prior to modifying it. This writefetch is needed, for example, if one word of a multiword block is being written. The alternative is to keep track of the portion(s) of the cache block that are "valid," which becomes costly when several disjoint portions of a large block are written. However, there are situations in which write-fetch can be avoided such as when the entire block is being overwritten, or when the contents of the rest of the block are predictable (e.g., when the block is a "new" block in a file system).
Prefetch. Because of spatial locality, a reference to a block often implies that the next block will soon be referenced. It is possible to take advantage of this anticipated reference and to prefetch the next block in advance. This reduces the delay when the next block is actually referenced. Prefetch is advantageous when it can be overlapped with processing of other references or when two or more blocks can be fetched in much less time than all of them individually, as is the case with disk secondary storage. Although it reduces the delay, prefetch increases memory traffic unless all prefetched blocks are referenced before they are replaced. It may also result in memorypollution in which a soon-to-be-referenced block is displaced to make room to prefetch an unnecessary block [30] . If a prefetch is only permitted in conjunction with a fetch, then the policy is a demand prefetch policy. Demand prefetch is desirable when the overhead of a fault is 82 l J. G. Thompson and A. J. Smith large; demand prefetch amortizes this over two (or more) blocks. With modern memory systems and file system caches, it is simple and inexpensive to initiate a prefetch even if the referenced block is present.
Metrics
The performance of a memory system can be measured in several ways. Perhaps the most widely used is the miss ratio, which is the fraction of references that were not satisfied by the cache. Conversely, the hit ratio is the fraction that were satisfied by the cache. The miss ratio is a latency metric since it determines the apparent access time of the memory system. The effective access time for any multilevel hierarchy is given by 2 tihi, where ti is the access time to the ith level, and hi is the fraction of references satisfied by the ith level cache. Sometimes overlooked is the fact that the access time to each level should include any queuing delays. These are usually negligible in a single-processor system, but may become important when several processors compete for access to a single secondary store [ 11. The actual computation of miss ratios during simulation varies with the parameters of the cache. Let N be the total number of references, and m(C) be the number of misses to a cache of size C. If all references are assumed to be reads, then the miss ratio for a cache of size C is given by
hence the name.
With write-through, where every write is a "miss" (i.e., causes an access to secondary storage), the miss ratio is
where m,(C) is the number of reads that "miss," and W is the number of write references. When write-back is used, a write could result in two accesses to secondary storage, one to fetch the block and another later to write it. The miss ratio is now given by
where m,(C) is the number of write misses (i.e., write fetches), and dp(C) is the number of dirty blocks "pushed" from a cache of size C. This becomes M&B(C) = (m(C) + dp(C))lN (1.4) by using the fact that a write-fetch is actually just a read reference and occurs if the block reference "misses." All of the expressions so far assume that the processor must wait for the write to secondary storage to complete before continuing.
It is often reasonable to buffer the writes so that the processor can continue almost immediately. In this case delay occurs only if there are enough accesses to create contention. In [31] it is observed that when memory bandwidth is adequate, four store-through buffers are sufficient to largely eliminate queuing for writes. Under this assumption, the write-back miss ratio with write-fetch is again simply MRwF(C) = m(C)/iV. (1.5) A related metric is the traffic ratio, which is the ratio of traffic between cache and secondary storage, measured in bytes, compared to the traffic that would be present without a cache [14] . The traffic ratio is increasingly important for analyzing shared-bus systems such as multiprocessor architectures or a network file system. Although buffering may eliminate write-back from consideration in the miss ratio, the write traffic is not eliminated, so writes must be considered in the traffic ratio. Also, prefetch may result in increased traffic since some prefetched blocks may not be actually referenced.
The traffic ratio is dependent on the same factors as the miss ratio and, in addition, depends on the size of the data blocks transferred. Suppose that the processor accesses BP bytes per average memory reference. The traffic without a cache is then B, times the number of references. Frequently, the cache block size, B, is larger than B,. We assume that each cache miss causes B, bytes to be transferred. Then a large cache block size may act as a form of prefetch and reduce the miss ratio, but it may also increase the amount of traffic.
The general form of the traffic ratio computation is TR(C,Bc) = h(C) + m,(C) +f(C) + dp(C)]*B,/N*B,, (1.6) where m,,(C) is again the number of write misses; dp(C) is again the number of write-backs; and f(C) is the number of prefetched blocks. This expression assumes that write-fetch is used. Notice that the traffic ratio is identical to the miss ratio when there is no prefetching, no write buffering, and the cache block size is the same as BP.
A third metric is the transfer ratio, which is the ratio of secondary storage accesses with and without cache [30] . This metric has also been called the transaction ratio [G. Gibson, personal communication 19861 , the I/O ratio [18] , and the swapping ratio [19] . The transfer ratio is similar to the traffic ratio but is more appropriate when performance is dominated by the cost of a memory access, relatively independent of the number of bytes transferred. Thus it is appropriate for disk caches and often for networks using small (1K or less) messages. For example, the transfer ratio decreases if two blocks are read from disk in a single I/O, whereas the traffic ratio is the same regardless of the number of I/OS used to transfer the data. The transfer ratio also has an indirect effect on the access time if there are enough transfers to create contention, particularly in multiple processor systems with shared memory. Assuming that prefetches occur only when the referenced block is not in cache (demand prefetch), then they do not affect the transfer ratio. A general expression for the transfer ratio is
which is almost proportional to the traffic ratio using constant block sizes. One common method for calculating these metrics is to use trace-driven simulation. Memory references are gathered from a system believed to be similar to the system being modeled. These references are then used to drive a simulation of the system under study with varying design parameters. To the extent that the traces apply to the modeled system, simulation is a relatively simple way to observe the effect of changes to the memory hierarchy. Unfortunately, it could take a large number of simulations if only a single combination of memory sizes could be simulated at a time.
In a classic paper, Mattson et al. showed that for certain replacement policies the miss ratios for all cache sizes could be calculated in a single pass over the reference trace [22] . These policies are collectively known as stuck algorithms. The technique depends on the inclusion property of these policies; the contents of any size cache includes (i.e., is a superset of) the contents of any smaller cache. Thus the cache at any time can be represented as a stack, with the upper k elements of the stack representing the blocks present in a cache of size k. The current stack level of any block is therefore the minimum cache size for which the block is resident. If a block is referenced while at level k, it is a "hit," and therefore resident, for all sizes k and larger. The level at which the block is found is referred to as its stuck distance; see Figure 1 . Using stack analysis, it is possible to compute the miss ratio of Equation (1.1) for all sizes by recording the hits to each level. The miss ratio for a cache size C is 8) where N is again the total number of references. Notice that, since hits(i) is never negative, this is a nonincreasing function of cache size. All stack algorithms possess this characteristic, whereas nonstack algorithms may show points at which performance declines with increased cache size [4] .
The simplest example of a stack algorithm is the Least Recently Used (LRU) policy. The stack always contains the blocks in order of last reference, with the most recently referenced block on the top. For any cache size C, the LRU block for that cache size is the block at level C in the stack. When a block at level k is referenced, it is not in any cache smaller than k, and therefore it must be fetched. The block that must be removed from any cache of size j, j smaller than k, is the block at level j. The stack is updated by simply "pulling" the referenced block out of the stack and placing it on top. All blocks down to level k are effectively "pushed" down one level. Since the referenced block was in all caches k or larger, all blocks below level k remain unchanged. Figure 1 illustrates these operations for the case where the referenced block is at stack level 4 and the case in which the block is not currently in the stack.
More generally, Mattson et al. [22] showed that any stack algorithm possesses a "priority function" which imposes a total ordering on all blocks at any given time, independent of cache size. Notice that LRU imposes such an ordering based on the time of last reference. However, in the more general case the relative priority of two blocks may change without either of them being referenced. (See, for example, Figure 4 where the relative positions of blocks A and B reverse between times 5 and 6.) It is no longer the case that the block at level j is necessarily the one to be pushed from that size cache. This complicates the stack update procedure, but only slightly. The stack can still be updated in a single pass that is similar to one pass of a bubble sort. A single comparison at each level determines the new block at the level and the block pushed from the level. First, the referenced block is still pulled to the top of the stack since it must become resident in all cache sizes. Using the terminology from [22] , let yI (C) be the block pushed ("yanked?") from a cache of size C. To make room for the referenced block, the top block in the stack, s,-] (l), must be pushed from a oneblock cache, becoming y,(l). Some block must also be pushed from a two-block cache-the one with the lowest priority. A single comparison between y, (1) = stPl (1) and slml (2) determines which becomes yt (2 (1) In step 5, all counts that are not incremented are assumed to remain unchanged at the next time interval; thus the subscript t is not used.
(2) In step 7,pmin returns the block with the lowest priority, as defined by the replacement algorithm.
Pmin is the comparison function in the circles of Figure 2 .
(3) In step 8, plus and minus have the intuitive meaning of adding a member to a set or removing a member. In this context, adding a member that is already present, or subtracting a member that is not present, have no effect. The same is true of adding or subtracting 4. Thus, the block kept at level i, st (i), is either sl-, (i) or the block pushed from above, yr (i -l), whichever is not pushed from level i. the lowest priority of the three blocks previously present. However, the lowest priority block can be determined in a single comparison of yt (2) and stP1 (3) since the third block, now .st (2), has already "won" a comparison against yt (2), and thus cannot have the lowest priority. Similar logic applies for all levels down to level lz, the original level of the referenced block; only the block currently at the level and the one pushed from above need be compared to find the block to be pushed. The contents of all sizes larger than k are again unchanged.
The stack analysis algorithm is formally presented in Figure 3 . This algorithm is used as the basis for the extensions in Section 2. Let:
x=xl,x* *** XN be a trace, where 3~~ is the reference at time t. S, = the cache stack just after reference to xt, with st (C) = the block at stack level C. so(C) = 4 for all C. A = the stack distance of xt, that is, .s-~ (A) = xt. yt (C) = the block pushed ("yanked") from cache of size C by reference xt. rh(C) = a count of the number of hits to level C by time t.
Note that, in practice, it is possible to search the stack for the referenced block and update the stack simultaneously, since the priority function cannot depend on where (or even if) the referenced block is in the stack. The update stops when the referenced block is found. The block being pushed takes the place of the referenced block, which is inserted on top of the stack. Fig. 4 . Cache contents using the Least Frequently Used policy. The number beside the block is the priority, that is, the number of references.
As an example, consider the application of a Least Frequently Used policy to the reference string {AAABBCCDB).
Using this policy, the block pushed from any cache is the one that has been used the fewest total times. Figure 4 shows the contents of the stack after each reference, where the number beside each block is the priority (i.e., the number of uses of the block). Notice that a block may be pushed several levels because of a reference, as seen at time 8. Note too that blocks below the level where the referenced block is found are unchanged, even though they may have higher priorities, as seen after the last reference.
Nonstack Algorithms
The prohibition against a priority function that depends on cache size prevents some otherwise simple policies from being stack algorithms such as the First-In First-Out (FIFO) rule [22] . Another common technique that is not a stack algorithm is the use of demand prefetch or prefetch-on-miss [30] . Suppose that the prefetch policy is to fetch the following block along with any fetched block, but not to prefetch if the referenced block is already present. Assume an arbitrary stack algorithm for replacement. It is easy to construct counterexamples that violate inclusion because the priority of a prefetched block depends on when it is fetched, which varies with cache size. For example, consider the examples of Figure 5 , where the contents of a larger cache are clearly not a superset of a smaller cache after the final reference.
It is possible to construct prefetch policies that are stack algorithms. For example, the nondemand policy that always prefetches the next block, regardless of whether the referenced block is resident, is a stack algorithm. This policy is a form of One Block Lookahead or OBL [2] . From the point of view of the stack, this is equivalent to the insertion of a reference to the next block after each reference. Nondemand prefetch is not practical if the cost of a fault is high, as it is in virtual memory systems, for example, because the penalty for faulting to prefetch a block that may not be needed is greater than the potential gain. Nondemand prefetch is practical when it is possible to look for the next block in the cache and prefetch it if necessary without significantly slowing down processing the current reference. This is the case for many large processor caches and file system caches. See also [ 151 for an interesting demand prefetch algorithm that does satisfy inclusion. Since this is not a stack algorithm, the contents of each cache size are listed separately. In both examples the next block is prefetched only if the referenced block is not present. In all cases the referenced block becomes the highest priority, followed by the prefetched block, if any. The inclusion property is violated after the last reference in both cases.
Extensions to Stack Analysis
There have been several important extensions to the basic stack analysis technique. Mattson et al. [22] showed how the hit ratio can be computed for an arbitrary number of levels, assuming a common block size and replacement policy. Gecsei [9] showed how it could be generalized to multiple levels with different block sizes for LRU and certain related policies. Traiger and Slutz [37] showed that it is possible to compute miss ratios for variable block sizes and variable associativity in a single pass. See also [ 131, [26] , and [28] . Coffman and Randell [6] investigated the "extension problem," that is, predicting the performance of cache sizes greater than C, given only the misses from cache size C instead of a full trace. For LRU, a trace of "pushes" and "pulls" was sufficient; for other stack algorithms, the priority ranking for the block pushed and all blocks not in the cache of size C was also required. A trace of misses only was shown to be sufficient for providing good approximations to the performance of larger caches in [29] .
A more recent extension by Silberman [27] showed that stack analysis can be applied to a delayed-staging hierarchy in which the processor directly accesses several levels of the memory hierarchy. When a referenced block is not in a higher level cache, it is supplied to the processor (at the speed of the highest level cache to contain the block) and begins "migrating" into the higher caches. The time elapsed until it becomes "staged" (resident) in a higher cache is equal to the sum of the access times of the caches below it. Further, the displacement of a block in the higher level cache is also delayed, creating a situation where the stack level of a block may be a function of the size of several lower level caches and the time since the last reference to one or more other blocks. Silberman showed that stack analysis can be applied to this class of hierarchy by maintaining the time and cache depth of the last "migration" for each block. This information is used at the time of each reference to compute the stack distance of the block for different sizes of each level, by considering the delayed staging times. This idea of maintaining additional information about each block is seen again in our write-back algorithm presented in the next section.
WRITE-BACK STACK ALGORITHM
We turn now to the development of a stack analysis algorithm for write-back. We begin by discussing the problems with write-back stack analysis, then present a general nonstack algorithm for computing the write-back ratios. We then prove that the algorithm obeys a form of inclusion and derive a corresponding stack algorithm.
The Write-Back Problem
In write-back, a write access to secondary storage occurs whenever a dirty block is "pushed." The main problem with write-back is maintaining the "state" (clean or dirty) of each block in the stack. A single dirty bit is sufficient in the real cache, but not for the simulation stack. Consider a read to a dirty block at level k. For sizes h and larger the block is still dirty, since it has not been written; for sizes 1 to k -1 it is clean. The inclusion property is violated since the contents of the larger cache are "different" in the sense that the block has different attributes in some larger sizes. A second problem is accounting for the "dirty pushes." Each miss from a memory of size C causes a push from each smaller memory; that pushed block may be dirty. On first inspection, this suggests that counts need to be maintained and updated for every memory size from which a dirty block is pushed. We show that a surprisingly simple technique solves both of these problems.
A Nonstack Algorithm
We begin by assuming that write-back is not a stack algorithm and by imagining a general algorithm for computing write-back miss or transfer ratios. The algorithm is based on the stack analysis algorithm from Section 1.3, but maintains a separate set of dirty blocks for each cache size in order to solve the problem of the noninclusion of dirty bits. In addition to the symbols defined in Section 1. D, (C) = the set of dirty blocks in a memory of size C after the reference to xt .
pt (C) = the dirty block "pushed" from a memory of size C by reference x1
otherwise dp(C) = the number of blocks written back from a memory of size C by time t. 
8A. dp(i) = dp Figure 6 For all events. If not referenced before.
Find the stack distance.
Update the read hits. Zf stack needs updating.
Calculate push set.
Calculate dirty push set. If block is dirty. Include in dirty push set.
Count dirty pushes.
Establish new stack.
Establish new dirty set.
We define Algorithm 2 in Figure 6 by adding steps 7A and 8A to Algorithm 1. When a block is written, it must be added to each dirty set (line 8A). A block is removed from a set if and only if a dirty block is pushed from memory (line 7A). Note that if write fetch is not used, then line 5 of Algorithm 2 must be conditioned on a read, that is, IF wt = 4 THEN r/z(A) = r/z(A) + 1.
Dirty Set Inclusion Property
The inclusion property of stack algorithms states that if a block is present in memory of size C, then it is present in size C + 1, and therefore in all larger sizes.
This can be formally stated as M,(C) G Mt (C + 1) for all t and C, where Mt (C) is the set of blocks present in a memory of size C after reference .lct. We now show that a similar condition applies to dirty sets; that is, if a block is dirty in a memory of size C, then it is dirty in all larger sizes.
An intuitive argument of this fact is the following. In order to become dirty, a block must be written, which makes the block dirty in all sizes. A block becomes clean only when it is replaced (ignoring deletions for now). Because the replacement algorithm is a stack algorithm, the block is always pushed from a smaller cache before it is pushed from a larger one. The dirty level is therefore the maximum level to which the block has been pushed since it was written. A read may pull the block to the top of the stack, but will leave it dirty in an inclusive set of sizes. There is no way to make it dirty in some sizes without making it dirty in all sizes; therefore, inclusion holds. A more formal proof follows. With these facts we can simplify the algorithm considerably. First, D,(C) !G D,(C + 1) implies that there is a minimum size at which a block is dirty (if it is dirty at all). Intuitively, this is the smallest memory from which the block has not been pushed since its last write reference, and therefore, the smallest memory size in which it is still dirty. This is also the largest stack distance the block has attained since it was last written. Therefore, the separate D,(C) can be replaced by a single array. Let dl(x) be the dirty level of block x or infinity, if the block has never been written. A block at level k (i.e., s(lz) = X) is dirty if and only if dl(3c) I k. We can set the dirty level to 1 when a block is written and update it as the block is pushed. 
Writes Avoided
Before defining the new algorithm, let us also reconsider the way dirty pushes are counted. In Algorithm 2, dp is updated as each block is pushed. Also, recall that the purpose of the write-back policy is to avoid the write to secondary storage that is required for each write reference when using write-through. We can count the number of write-backs required in two ways. One is to count them directly, as done by Algorithm 2. The other is to count the total number of writes and then to subtract the number of times that no additional write-back is required, since the block is already dirty or is being deleted. When a write does not require a write-back, we increment a count of writes avoided. This is analogous to the way reads are computed in the basic stack analysis algorithm, where a read is avoided for all sizes larger than the current stack distance.
Ignoring deletes for now, a write is avoided only when a dirty block is overwritten, since both the previous and current modification can be written by the next write-back. Therefore, we can say that the previous write has been avoided for all sizes equal to or greater than the current dirty level. Notice that we now only care about the dirty level for the block being referenced, and therefore, we only need to adjust dl for the referenced block. If it is found at level A which is below its dirty level (i.e., A > dl(x,)), we can reason that the block has been pushed (while dirty) from all levels between dl (x, ) and A; therefore, the proper value for dl (x,) is A; see line 6 of Algorithm 3 ( Figure 7) .
We now define wa(C) to be the writes avoided at level C, that is, the number of writes for which the referenced block was still dirty in memory sizes C and larger. The write-back stack algorithm, Algorithm 3, is shown below. The differences between this algorithm and Algorithm 1 are line 6, which adjusts the dirty level as described above, and lines 10-13, which count the writes avoided and write references and reset the dirty level to one on a write.
For the special case of LRU, this algorithm is particularly simple. As in the standard stack analysis algorithm for LRU, updating the stack is a matter of removing the referenced block and inserting it at the top of the stack. The fact that only the referenced block affects the statistics is particularly useful for this case since no work needs to be done while searching for the referenced block.
Dirty Push Computation
Using Algorithm 3, the number of dirty pushes which have occurred by time t for a memory of size C is given by dp(C) = W, -i wa IF dl (r, ) # ~0 THEN wa(dl(x,)) = wa(dl(x,)) + 1 12.
dl(x,) = 1 13. w, = w,-, + 1 Figure 7 For all events. If not referenced before.
Update the read hits. Set the "real " dirty level.
If stack needs updating. Calculate push set.
If this is a write.
Count writes avoided. Block is dirty. Count of write references.
The first two terms of (2.1) are obvious, but we should elaborate on the need for the third term. It should be clear that each block that is still dirty has avoided the most recent write for all sizes in which it is still dirty and should therefore be subtracted from the count of writes. This argument applies at any point during the trace and at the end of the simulation. Since the relevant metrics are those gathered during the trace period, regardless of any activity which occurs after the trace ends, we should consider each dirty block remaining at the end of simulation as having avoided a write. To simplify the computations, we can make a final scan of the memory stack and update wa(dl(x)) for each dirty block z. We can then eliminate the third term of (2.1). Of course, the effect of this should be small if the total number of trace events is large.
Using this expression for the number of dirty pushes leads to a simple recurrence for computing the transfer ratio. Recall that equation (1.7) for computing the transfer ratio from Section 1.2 is
Assuming write-fetch, the first two terms can be replaced by the stack analysis computation of the miss ratio of (1.8), giving Substituting (2.1) for dp(C) and assuming that the final scan has updated wa, this simplifies to Notice that since rh(i) and wa(i) are both nonnegative, this function also decreases as memory size increases, just as the miss ratio does.
Warm Start
If the simulation results are gathered starting from an empty stack, the results can be biased by the fact that many of the early references will be misses in all cache sizes. In fact, until the memory contains k blocks there is no chance of a hit at level k producing a higher than expected miss ratio. In some situations this cold-start miss ratio is appropriate, for example, when a single-program address trace is used to derive multiprogramming metrics [8] . In other situations, the desired metrics are those for a system in steady state. In these cases it is common to warm start the simulation to reduce startup effects. A warm start consists of allowing the simulation to run until it is assumed to be in steady state, often either for a fixed number of events or until the memory contains a fixed number of blocks, then stopping. Without changing the state of the simulation, all statistics are cleared. The simulation then resumes from its current state. The final metrics are those gathered after the warm start.
Warm start using the write-back algorithm can produce an anomaly in the transfer ratio. This is caused by the final scan of memory which considers all dirty blocks as having avoided a write, which may have occurred before the warm start. Suppose, for example, that the write-back simulation is warm started, and suppose that W, and wa are zeroed. Then immediately after warm start, the value of dp(C) calculated using (2.1) may be negative for some values of C, as shown in Figure 8 , where the number in parentheses is the dirty level of the block. Of course, a "negative push" is meaningless. We can keep the numbers positive by setting Wt to the number of dirty blocks in the cache at warm start, but then dp is immediately nonzero for some cache sizes. Another alternative would be to zero both wa and dl, but then it will be a long time before any dirty block could be pushed from large sizes-in conflict with the reason to warm start in the first place.
Since the third term of (2.1) increases with C, the second term of (2.1), the sum of wa must decrease for larger C if we want the computed value of dp to be zero immediately after warm start. This can only happen if some wa are negative. The solution we use is to zero wa at warm start, then decrement wa[dl(x)] for all dirty blocks X. With this solution dp(C) is zero immediately after warm start for all C, as it intuitively should be; see Figure 9 (a). Now suppose that a reference to a previously unreferenced block causes all blocks to be pushed (Figure 9(b) ). The result is that dp(C) is zero for all sizes except those from which a dirty block is pushed-which is exactly the result obtained from a simulation of a single cache size or a real cache.
Note, however, the unexpected result that the transfer ratio due to dirty pushes is no longer a monotone decreasing function of size. In fact, if the warm start of Figure 9 (a) were followed by the unlikely event of five total misses, the resulting Level Stack U'U dp Ll?vel Stack ~'0 dp -------- Fig. 9 . Revised count of dirty pushes after warm start: (a) is immediately after warm start, and (b) is after all blocks are pushed one level. The count of dirty pushes from each size, dp(C), agrees with the results from a real cache.
transfer ratio would be increasing with cache size. It seems that the rate of dirty pushes may be exaggerated for larger cache sizes by the fact that there are more dirty blocks in the larger cache. (There may also be a higher probability that blocks pushed from larger caches are dirty; see Section 5). This "error" for large sizes is bounded by the number of dirty blocks in the stack divided by the number of references after warm start. It can therefore be made arbitrarily small by increasing the number of references after warm start (which also reduces the need for warm start). In most cases, locality causes the write-back traffic ratio to assume its normal decreasing form.
EXTENSIONS
In addition to write-back, several intermediate and related policies can be analyzed using our technique.
Write-Through
This policy is trivially included in the algorithm by setting dZ(x,) to infinity instead of one after a write. In fact, since the total number of write requests is known, both the write-back and write-through transfer and traffic ratios are available simultaneously. It is also possible to simulate a combination of policies provided the choice of policy is not a function of memory size. For example, some blocks could be write-through and others write-back, a scheme used in some real caches, for example, the Intergraph CLIPPER processor (5, 201 and the NEC disk cache [36] .
An example of an algorithm for such a cache is given as Algorithm 4 ( Figure 10 ). The only difference between this and Algorithm 3 is that Line 12 IF dl (x, ) # 00 THEN wa(dl(x,)) = wa(dl(x,)) + 1 12.
IF BLOCK IS WRITE-BACK THEN dl(x,) = 1 13.
w,=w*-,+1 Figure 10 For all events. If not referenced before.
Update the read hits. Set the "real n dirty level.
If stack needs updating.
Zf a write. Update dirty pushes. Count writes avoided. Write to write-back block.
Block is dirty.
ensures that only write-back blocks become dirty. Writes to both write-back and write-through blocks are counted in W, (Line 13), but only writes to write-back blocks are avoidable (Line 11) since write-through blocks are never dirty. As a simplification, both write-back and write-through are counted the same. In reality, a write-through may involve less data and, therefore, is less costly. This algorithm assumes that write allocate and write fetch are performed for writethrough blocks; if this is not the case, then Line 5 and the stack update should be bypassed for a write-through miss.
Periodic Write-Back
With large caches, there may be a very long delay before a block is removed by replacement. In practice, reliability considerations may dictate that a dirty block be written before this time. Suppose that all dirty blocks are written every n seconds instead. An example of this is the UNIX@ file system policy of writing all dirty file system buffers to disk every 30 seconds. Alternatively, suppose only certain blocks are written, for example, by a policy to write a block after it has been unreferenced for n seconds. These policies are all stack algorithms, provided that the write happens for all memory sizes where the block is dirty, in order to maintain inclusion in the dirty set.
A forced write-back is implemented in the algorithm by setting dl (x) to infinity for each written block. It has no effect on writes avoided, except that the write which made the block dirty cannot subsequently be avoided. The effect of this is to increase the calculated number of dirty pushes. Consider the third term in (2.1) for any C where the block is dirty: The block was dirty and included in D,(C); it is now clean and not in the term; the net increase to dp(C) is 1. 
Deletions
An important consideration in file system studies is the existence of deletions in the reference string. If a file is deleted, the blocks of that file should be removed from the cache without a write. With a write-back cache and short file lifetimes, it is likely that file blocks will be created and deleted without ever being written to the next level [24] . Deletions also occur in processor caches when blocks are invalidated but generally not without writing the block first if it is dirty. This case is discussed in Section 3.4. Deletion of blocks from the cache was discussed by Mattson et al. [22] in the context of a "call back" hierarchy, where cache blocks may be invalidated by a write directed to a lower level. The example used by Mattson is a virtual memory system in which all I/O occurs to blocks residing in an I/O Subsystem, not the CPU memory. If an I/O is addressed to a block which is in CPU memory, that block must be invalidated. Greenburg [ll] also discusses deletions and implements an algorithm to approximate the effect of deletion. Olken [23] proposes an exact algorithm and discusses implementation using various data structures. None of these consider the effect of write-back.
If a deleted block were simply deleted from the stack, the stack level for all lower blocks would be reduced. This would have the undesirable effect of calling these blocks back into a memory from which they had been pushed. Instead, what Mattson called a "marker" block is inserted in the stack replacing the deleted block.
We refer to the marker blocks as gaps in the stack, corresponding to a vacant block in all larger caches. The next push from above the gap replaces the gap with the pushed block since no block needs to be replaced in a cache containing a vacant block. Thus a gap stops the sequence of stack updates, just as finding the referenced block stops the pushes in the normal case. However, since the referenced block must still be pulled to the top and blocks below the referenced block do not change stack level, the referenced block must be replaced in the stack by another gap. Thus, a reference to a block below the first gap seems to make the gap "jump" down the stack. As an example, consider the sequence of Figure 11 . After block D is deleted, a gap is left at level 4. A reference to block B above level 4 does not affect the gap. However, the reference to block F below level 4 "jumps" the gap to the stack level of F. From the point of view of the "real" cache, the gap represents the same vacant block, which was in all memory sizes 4 or larger. Since block F is already resident in memories of size 6 or larger, the reference to F has not fetched any block to fill the gap. Therefore, the gap still exists in these sizes.
The effect of deletions on the transfer ratio is to introduce another way in which a write can be avoided, particularly evident in large cache sizes. If a block is written, then deleted before it is pushed, the write-back is avoided for all sizes greater than the current dirty level. It is therefore a simple matter to increment the appropriate wa[dZ(x,)] on deletion. In addition, the count of read hits must exclude deletes, since a deleted block is never fetched. This is seen in lines 6 and 7 of Algorithm 5.
The complete, though somewhat complicated, algorithm for write-back with deletions is given as Algorithm 5 ( Figure 12 ). Let: y = a gap marker in the stack. T = the level of the first gap in the stack. A' = min(A, I'), the level at which pushes stop.
There are actually only a few changes between Algorithm 3 and Algorithm 5. a gap if it was below the first gap by implementing the "jump" of a gap described above.
Flush Back
There is sometimes a need to write back a dirty block and remove it from the cache before it is a candidate for replacement. An example is when a block is flushed from a private cache on a multiprocessor bus so that another processor can acquire the block [17] . Flushing the entire cache periodically is also used in some processor simulations to approximate multiprogramming effects [32] . It should be clear that this can be simply implemented as a periodic write-back followed by a delete. The contents of wa are unchanged.
SECTOR CACHE SIMULATION
We now consider the study of subblock or sector caches and show that they too can be simulated using stack analysis by a technique similar to that used for write-back.
Background
A typical cache consists of blocks or sectors of data, each with associated tags identifying the virtual addresses contained in the block; see Figure 13 (a). If smaller blocks are used, the total space for tags increases (Figure 13(b) ) since each of the blocks requires its own tags. Regardless of size, each block also requires a valid bit indicating whether the block contains valid data.
An alternative arrangement is the subblock or sector cache. In a sector cache, each cache block or sector is divided into a fixed number of subblocks or subsectors. Goodman uses the terms address block and transfer block for these sectors and subsectors, respectively [lo], whereas Liptay uses the term sectors and block [21] and Hill proposes block and subblock [14] . Throughout this section we will use IEEE-proposed terminology for such caches, referring to sectors and subsectors [16] . Tags are associated with the sector as a whole; see Figure 13 (c). Transfers between the cache and secondary storage are done in units of subsectors. In addition, there must be a valid bit for each subsector to indicate whether or not the subsector data is present.
Sector caches are motivated by two factors. The first is a need to reduce the number of tags to be searched. This was the motivation when it was first used in the IBM 360/85 cache [21] . The reduced number of tags also reduces the chip area needed for tags in a VLSI cache. A second reason for subsectors is to reduce the size of each data transfer. On a cache chip with limited pins for parallel data transfer, a large sector size would require multip!e cycles, where a smaller subsector can be transferred in one parallel access. Similarly, on-chip data path widths favor a small sector.
The smaller subsectors may also be used to reduce memory bus traffic when the bus is a potential constraint [lo] . For example, suppose that one bus cycle is required to access each subsector, and there are four subsectors per sector. It would take four cycles to access a sector-size block, whereas a subblock cache requires a cycle only for those subsectors actually referenced. Although this generally decreases the traffic ratio, the transfer ratio usually increases. Thus there is an advantage only when the bus loading is largely a function of the number of bytes transferred; see Section 1.2.
A sector cache tends to have a higher miss ratio than the same size cache with subsector-sized blocks because of the rigidity in the assignment of subsectors. It will usually also have a higher miss ratio than the same size cache using sectorsized blocks because each subsector can cause a fault. However, misses that fetch smaller sectors may "cost" less than larger sectors in some cases. At the same time, the sector cache reduces the traffic ratio compared to the nonsector cache with large blocks by not loading subsectors which are not needed, as would be the case if the entire sector were loaded. The performance of subsector processor cache is studied by Hill and Smith [14] .
The disk cache in the IBM 3880 Control Unit is alsb a form of sector cache [12] . The sector size is a full track with a variable number of subsectors, one for each disk record. This organization was chosen so that the cache could be a physical and logical copy of the disk contents, while offering the advantages of caching. To avoid holding up the processor while waiting for a full track to be transferred, the disk is positioned to the requested record which is then transferred to both the processor and controller cache. After signaling completion of the requested I/O, the controller continues to read to the end of the track into cache, anticipating further sequential requests. This form of'prefetch is called load forward and is discussed in Section 4.3.2.
The Stack Simulation Problem
The problem with stack simulation of a sector cache is that the valid bits do not obey inclusion. For example, suppose subsector 1 of a sector is referenced and becomes valid. Now suppose that the sector is pushed to level k in the stack; then subsector 2 is referenced. The entire sector must be pulled to the top of the stack in order for subsector 2 to become valid in all cache sizes, but subsector 1 is valid for some sizes (k and larger) and invalid for others.
Our solution is to replace the valid bit with a valid level for each subsector. The valid level is the minimum memory size for which the subsector is still valid If sector not in stack.
Find the stack distance. For each subsector. Fix valid leuels. Stack distance for (x, a).
Update the read hits.
If the stack needs updating.
Calculate the push set.
(x, a) valid in all sizes. Figure 14 or infinity if the subsector has never been referenced. A reference to any subsector pulls the entire sector to the top of the stack. As in the case of write-back, there is no need to adjust the valid levels as a sector is pushed; if a subsector has a valid level less than the current stack level of the sector, the valid level is set to the current level since no subsector can be valid in smaller cache sizes. Finally, the referenced subsector is assigned a valid level of one since it must be present. The formal algorithm is similar to the one for write-back and is presented as Algorithm 6 (Figure 14) . The terms are somewhat different from those used previously.
Let:
X=(x,, al), (x2, aa) *--(xN, uN) be a series of references, where (xl, a, ) is a reference to sector I, subsector a at time t. (x, *) = any subsector of sector X. B = number of subsectors per sector. ul, (x, a) = the valid level of subsector (x, a). Line 5 adjusts the valid level of all subsectors to ensure that they are smaller than the level of the sector. Then notice that the count of hits in line 7 is based on A,, the valid level of the referenced subsector, not the stack level of the sector as a whole. For example, in Figure 15 , a reference to subsector Al is a hit at level 4 since the subsector is not present in sizes smaller than 4. On the other hand, a reference to subsector B2 is a hit only at level 2 since the sector as a whole is absent from size 1. However, the stack is updated to the level of the sector as a whole, A, since the entire sector is pulled to the top of the stack.
Extensions
4.3.1 Write-Buck. The first obvious extension is to consider write-back with a sector cache. Since these are independent, they can be combined by maintaining a dirty level in addition to the valid level. The dirty level could be associated with the entire sector if the cache writes-back entire sectors. However, since part 102 ' J. G. Thompson of the motivation for the subsector cache is to reduce bus traffic, the dirty level is more logically associated with each subsector. The algorithm is similar to those already presented.
4.3.2 Load-Forward. Load-forward is a form of prefetch associated with sector caches [ 141. After loading a requested subsector, successive subsectors are loaded until the end of the sector. As with any prefetch, this reduces the miss ratio because of the strong probability of sequential references. However, unlike normal prefetch, there is no chance that load forward can cause memory pollution [3O] by displacing a soon-to-be-referenced sector. Load forward may be implemented as either a demand or nondemand prefetch policy. However, we show that even with demand prefetch, load forward is a stack algorithm. The key that distinguishes load forward from other forms of prefetch is that load forward always prefetches to the end of a sector and no farther, rather than prefetching a fixed number of blocks or subsectors. To show that it is a stack algorithm, we again imagine a general algorithm for load forward (Algorithm 7, Figure 16 ) which does not assume that inclusion holds. The algorithm uses a stack to determine which sector is replaced at each reference but keeps a separate memory set M,(C) of all subblocks present in size C. For simplicity we ignore writes. In addition to symbols previously defined let: (x, a)" = (x, a) plus the set of all sectors/subsectors prefetched with subsector (x, a). M, (C) = the set of valid subsectors in memory of size C. = ((x, a): (x, a) in memory of size C at time t ) st (C) = the set of valid subsectors of the sector that is at stack level C. Note that there can be subsectors which are valid at larger levels but not at C. = ((x, a):(x, a) E M,(C), (x, i) 4 M,(j), 1 5 i 5 B, for allj < Cl yt (C) = the sector pushed from size C.
To prove that load forward is a stack algorithm, we want to show that inclusion still applies, that is, M, (C) C Mt (C + 1) for all t and C. We can immediately think of a situation where this will be violated. For example, suppose (x, a) prefetches (x, b), and these subsectors are valid at the levels shown in Figure 17(a) . Now let (x, a) be referenced. For all sizes less than lz, (x, a) is fetched, prefetching (x, b). For sizes greater than k, (x, a) is not fetched; therefore (x, b) is not either because of prefetch only on demand. The valid levels, which are shown in Figure 17 For all events.
If sector not in stack.
Find the sector distance. Find the subsector distance.
Update the read hits. Zf the stack needs updating.
Establish new memory. However, suppose the initial configuration is reversed, as shown in Figure 18(a) . The result of a reference to (x, a) is that (x, b) is prefetched for all sizes less than 1 (although it only needs to be accessed from secondary storage for sizes less than k), resulting in both subsectors becoming valid in all sizes; see Figure 18 (b). Therefore, a necessary condition for inclusion is that the first configuration can never occur using load forward.
An intuitive argument that this is the case is the following. The first time that subsector (x, a) is referenced, both (x, a) and (x, b) are fetched and become valid in all sizes. As sector x is pushed, both simultaneously become invalid in the PROOF. Choose an arbitrary size C. The condition is certainly true at the start when the cache is empty. Assume the induction hypothesis that the condition holds at time t -1. We show that it holds after the reference at time t. Consider the possible configurations of (x, a) and (x, b) that could lead to (x, a) present at time t. The condition is therefore true if the prefetch sets obey the transitive condition that, if (x, b) E (x, a)' and (x, a) E (x, c)+, then (x, b) E (x, c) '. This condition is satisfied by load forward since it loads the rest of the sector (but would not be if it loads just the next subsector, for example). 0
We can now show formally that Algorithm 7 satisfies inclusion. PROOF. It is certainly true at the start. Assume it is true at time t -1 for an arbitrary size C. Again, we add and subtract blocks from both sides, which preserves the subset relation, and arrive at an expression which shows that it is true at time t:
By an argument similar to the one used to prove Theorem 2.1, yI (C + 1) can have three possible values, none of which affect the subset of the left-hand side of the relation. For all events.
If sector not in stuck.
Find the sector distance.
Fix valid levels. Find subsector distance.
Establish new stuck.
(x, a)+ valid in all sizes.
Because of inclusion, we can convert Algorithm 6 to a load-forward algorithm using valid levels as indicated in Algorithm 8 ( Figure 19 ).
EXPERIMENTAL RESULTS
In this section we demonstrate two advantages of write-back analysis by reporting the results of simulations using the technique. The first result shows that stack analysis can be much faster than approximating the miss or transfer ratio using several single-size simulations. The second shows that other useful statistics can be produced as a by-product of write-back stack analysis.
In this case, we analyze the probability that the pushed block is dirty as a function of cache size.
The Trace Data
The traces used in these comparisons consist of instruction and data addresses from the execution of programs on one of several machines. They represent a variety of different applications in three different languages. The traces are FGOl (IBM 370, Fortran execution, factor analysis), FGOZ (IBM 370, Fortran execution, analysis of satellite data), MVS (standard MVS operating system workload at Amdahl Corp.), LISPCOMP (VAX, LISP compiler, written in LISP), SPICE (VAX, Spice circuit simulator, written in Fortran), VAXIMA (VAX symbolic algebraic manipulation program derived from Macsyma, written in LISP), and RISC (simulated execution of a C compiler for a RISC-architecture processor). All traces except MVS represent the execution of a single program. Most have been used in previous studies [25, 33] . We used a 16-byte block size for all traces, as used in [33] . The simulations using these traces considered only data caching; instruction fetches were ignored because they are never writes, and we also wished to compare the write-back results to those of Smith [33] .
Two sets of simulations were done with each program address trace. First, each was simulated as if it were a stand-alone program. This provides a characterization of each program but generally gives an optimistic prediction of the actual performance of the program in a multiprogramming environment [8] . In the second experiment, we used the technique proposed by Smith [32] Note: This table shows the number of references considered from each trace tile and the percentage of these references which were reads or writes. It ignores instruction fetches in the program address traces. The fourth column shows the number of unique blocks in the trace, and the next column gives the mean stack size. Flushing reduces the mean stack size, as do deletions from the file system traces. The final two columns show the number of blocks that are written at some time during the trace (dirty blocks) and the mean number of dirty blocks in the stack. These are also shown as percentages of the unique blocks and mean stack size, respectively. the effects of multiprogramming by writing all dirty blocks and flushing a cache after a fixed time quantum, in our case every 20,000 memory references. The simulations without flushing were warm-started; the others were not, producing some variation in the number of read/write references shown in Table I for the same trace file.
By way of comparison, we also ran simulations using UNIX 4.2 BSD file system traces generated on three university research computers. The traces are identified by machine: ARPA was a VAX 11/780 used for operating system research and development and text processing; ERNIE was a VAX 11/780 used by staff and graduate students for program development and text processing; CAD was a VAX 111750 used for computer-aided design research. All three machines were also used extensively for electronic mail. These traces were analyzed in detail by Ousterhout et al. [24] . The trace events show logical file creation, deletion, opens, closes, and seeks. Actual reads and writes were not recorded; however, each close or seek event includes the range of bytes read or written since the last positioning event for the file. The simulator recreates reads and writes in block-size units based on this information.
These traces tend to overestimate the miss ratio since some of the simulated reads/writes were actually several small requests. For these simulations, we used a block size of 4096 bytes, consistent with common UNIX 4.2BSD usage. There is no information on program paging or file system overhead activity such as directories. All files are identified by a logical identifier; there is no data on physical location [24] . Table I shows the general characteristics of the traces. We processed approximately 500,000 references from each file after warm start, but the count of references in the first column only includes the data references, ignoring instruction fetches for the program address traces and open/closes for the file system traces. The second and third columns show the percentage of these references that are reads or writes. We initially speculated that writes would be a more significant percentage of the file system traces. However, when instruction fetches are excluded from the traces to simulate a data-only cache, the fraction of writes are comparable. We conclude that writes are a factor that should not be overlooked in any cache design.
General Characteristics
The fourth column shows the number of unique blocks seen in each trace file. This number is exaggerated by cache flushing because a block reloaded after a flush is considered a new block. It is clear, however, that the file system traces come from a much larger population of blocks. The fifth column shows the mean stack size, that is, the average number of valid blocks in an infinite-size cache. These two columns together indicate the range of interesting cache sizes for study. For example, program address traces with flushing seldom use more than a few hundred blocks, whereas 10,000 blocks may be too few for a file system simulation. Notice that the mean stack size for the program address traces without flushing is generally about half the total number of blocks for the same trace, whereas the mean stack size for file system traces is less than 10 percent of the number of blocks. This is because nearly 90 percent of the file system blocks are deleted. Deletions do not decrease the stack size, but they do leave gaps which allow other blocks to be added without increasing the stack size.
The final two columns indicate the impact of write activity. The column labeled dirty blocks shows the number of blocks which are ever written. This figure is also shown as a percentage of the number of unique blocks. The fraction of the blocks in the cache which are dirty will obviously affect the chances that a block must be written when it is pushed. The file system traces have far more dirty blocks (84 compared to 50 percent). The final column shows the mean number of dirty blocks in the cache, shown also as a percentage of the mean stack size. Although there is a wide variation between the individual program address traces, we find that, on the average, the fraction of the cache which is dirty is about the same, near 44 percent, for both program data and files. This value seems surprisingly low for the file system traces, considering that 88 percent of all blocks are written. The reason for this result is that most blocks are deleted Note: Comparison of run-time simulations using nonstack and stack techniques. The nonstack simulation computes the miss ratio for a single cache size, whereas the stack simulations compute all sizes in one run. The simple stack simulation uses a linked-list implementation of the stack. The best stack simulation uses one of the alternative implementations discussed in [35) . before they are pushed very far down the stack, and most of the deleted blocks have been written. The blocks that "survive" account for most of the blocks in the cache.
Run-Time Comparison
As we stated earlier, the chief advantage of stack analysis is that it allows the desired metrics to be calculated for all cache sizes in a single pass of the trace data. Although the overhead of maintainin g the memory stack usually makes stack analysis take longer than the simulation of a single cache size, it should take only a fraction of the time required to produce a reasonable curve using several single-size simulations. We have used our write-back stack algorithm in the analysis of a variety of trace data and find that this time savings does not always occur. For example, file system traces typically exhibit much poorer locality than single-program address traces. This results in excessive run times using the straightforward implementation of the stack simulator. Several techniques exist to reduce the execution,time of stack analysis by using a tree-based representation of the stack [23, 351 . Table II shows the execution times for simulations using several traces of various types described earlier. All simulations use LRU replacement. The first column shows the time required to compute a point on the miss or transfer ratio curve using a simple simulation of a single cache size. The simple simulation maintains an LRU list to determine the block to be removed from cache if -J. G. Thompson and A. J. Smith necessary but uses a hash table to find the referenced block in the list. Therefore, the running time is independent of cache size. The second column shows the time for stack analysis to compute the entire curve using a naive implementation that searches the simulation stack from the top to find the stack distance of the referenced block. Also shown is the number of single-size simulations that could be done in the same time. We can see that stack analysis of the file system traces is not efficient using this implementation. The third column shows the time required using the best stack implementation presented in [35] . All times are in seconds for simulations of approximately 500,000 references running on a VAX 111750. We see that the stack algorithm takes on the average 22 percent more time for memory address traces and twice as long for the file system traces, as compared to the single-size simulation. However, it reduces the execution time by as much as 90 percent for the program address traces, and at least 80 percent for the file system traces, when compared to the time required to approximate the miss/transfer ratio curves using 10 nonstack simulations.
Write-Back Probability
Before the discovery of our write-back stack algorithm, an attempt was made to estimate write-back traffic in the following way. Each time a miss occurs (ignoring gaps), a block is pushed from all cache sizes. The write-back traffic should be approximately equal to the miss ratio times the probability that the block pushed from cache is dirty [33] . Smith found for data blocks that about half of all pushes from a 16 Kbyte data cache are dirty, but with wide variations between programs. He also reasoned that the probability that a push was dirty increases with cache size, since the pushed block will have been resident longer and hence have a higher probability of having been written.
Write-back stack analysis allows us to compute the probability of a dirty push directly and to validate the previous finding for all cache sizes. Although the probability of a dirty push is not needed directly in the computation of transfer ratios using equation (2.3), it is useful in its own right, for example, as a parameter of a queuing model of a memory system; for example, see [l] .
In Figure 20 we show the probability that a pushed block is dirty as a function of size. We see that on the average the projected increasing trend holds, although the probability for our traces was closer to 40 percent. However, the trend for individual traces is not consistent, and some show a distinct downward trend. Tables III-V show the same data, as well as the percent of references which are writes and the percent of blocks which are ever written (percent dirty). Notice that the two traces which show the strongest downward trend are the two LISP traces, LISPCOMP and VAXIMA.. These are also the only two which have a higher percent write than percent dirty. We believe there is a relation between these observations, as explained below.
First, notice that the probability of a dirty push from a single-block cache is close to the probability of a write for all traces. This is certainly reasonable since most blocks are pushed from the single-block cache shortly after they are referenced. For other small cache sizes (in the range 2-15), the chance that a block has been written does increase as predicted, and most of the traces exhibit an upward trend. However, some blocks are never written, and the chance that a clean block will ever be written decreases as it is pushed down the stack. At the other extreme, notice that without flushing (Table IV) the number of pushes eventually reaches zero for cache sizes that hold all the blocks of the program. Therefore, when flushing is used (Table III) , all of the pushes from large cache sizes are due to flushing. The probability that a block flushed from a large cache is dirty should be very close to the fraction of blocks which are dirty, as it is.
Since we can predict the probability of a dirty push from both small and large caches, we naturally expect that the trend should be from the percentage of writes to the percentage of dirty blocks. This explains the observed results, but suggests that they may be an artifact of the flushing methodology. We believe this is not the case and offer another explanation.
In Table VI we classify all blocks into one of four classes: read-only, read/write (in no particular order), write-only, and write-once/read (e.g., a variable which is initialized and subsequently only read). We expect the latter class to be small in program address traces and larger in the file system traces. The table also classifies all references as to the class of block they reference. Both of the LISP traces show a surprisingly large fraction of read-only blocks. At the same time, these blocks receive a relatively low fraction of references. Therefore, a few "dirty" blocks are receiving most of the references, and therefore are more likely to stay near the top of the stack. Thus, the blocks being pushed from larger caches are more likely to be from the large class of clean read-only blocks. This generally explains the decline in the probability of a dirty push, independent of whether we use flushing. We cannot say whether this phenomenon is characteristic of LISP programs in general.
In passing, we note that the UNIX file system traces also show an increasing trend in the probability that a pushed block is dirty, although not nearly as much as might be predicted based on the fact that 85 percent of all blocks are eventually written. The difference again is caused by deletions, leaving about half of the remaining blocks dirty. We also note that the class of write-once/read is very common for file system blocks, as predicted.
Another interesting observation from Table VI is the high percentage of writeonly blocks in the file system and some program address traces. Because of the low percentage of references to these blocks it is likely that most were only written once. For the program traces, these can be explained by t.he initialization 114 * J. G. Thompson and A. J. Smiih of a large data area (or perhaps a program area in the MVS trace) that is not used again during the observed portion of the trace. There are several possible explanations for the file system traces. The first are files that are created or written during the trace and simply not read before the trace ended (just over two days). The second possibility are files that are written and then deleted without being read. A file that is entirely overwritten is also considered to have been deleted. Common examples of this second type are editor recovery files, object files containing errors, and rwho status files which are written every three minutes (and rarely read during the evening hours). A third possibility are files which are truly write-only such as accounting or log files. However, our analysis shows these to account for less than five percent of the blocks.
CONCLUSIONS
In this paper we have shown how stack analysis can be extended to two important new areas. The ability to collect transfer ratios, considering both reads and writes, for all memory sizes in a single pass reduces simulation time by as much as 90 percent compared to running 8-10 individual simulations, making this metric much more reasonable to collect. The transfer ratio is increasingly important in the study of shared-memory systems, including multiprocessor caches and network file systems. Equally important, the ability to easily simulate sector caches, including writes and a form of prefetch, opens up a variety of new cache designs to efficient analysis.
