Abstract| Memory hierarchies have long been studied b y many means: system building, trace-driven simulation, and mathematical analysis. Yet little help is available for the system designer wishing to quickly size the di erent levels in a memory hierarchy to a rst-order approximation. In this paper, we present a simple analysis for providing this practical help and some unexpected r esults and intuition that come out of the analysis. By applying a speci c, parameterized m o del of workload locality, we are able to derive a closed-form solution for the optimal size of each hierarchy level. We verify the accuracy of this solution against exhaustive simulation with two case studies: a three-level I O storage hierarchy and a three-level processor-cache hierarchy. In all but one case, the con guration recommended by the model performs within 5 of optimal. One result of our analysis is that the rst place t o s p end money is the cheapest rather than the fastest cache level, particularly with small system budgets. Another is that money spent on an n-level hierarchy is spent in a xed p r oportion until another level is added.
I. Introduction F AST memory and storage systems are vital to achieving good system performance, as CPU speeds increase faster than memory and disk speeds. Almost all systems use caching throughout the disk, memory, and processor subsystems to improve the average time to access data, but the widening gap between storage technologies makes it easy to lose signi cant performance through poor cache sizing. Unfortunately, little practical help exists for system designers and administrators seeking to optimize their cache hierarchies. Exhaustive simulation takes far too long, particularly as hierarchies become more complex 16 ; trial and error on running systems is usually impossible; and prior mathematical analyses have stopped short of providing much-needed, intuitive insight i n to cache sizing 10 or have assumed the availability of memory technologies with B. Jacob, P. Chen, and T. Mudge are with the Advanced Computer Architecture Laboratory, Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109-2122. e-mail: blj@eecs.umich.edu; pmchen@eecs.umich.edu; tnm@eecs.umich.edu.
S. Silverman is with Chelmsford Systems Software Lab, HewlettPackard, Chelmsford, MA 01824. e-mail: seth@apollo.hp.com Personal use of this material is permitted. However, permission to reprint republish this material for advertising or p romotional purposes or f o r creating new collective works for resale or distribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or b y other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
arbitrary speeds and costs 20 .
In this paper, we analyze the performance of a general, n-level memory hierarchy using a parameterized workload characterization. Our intent is to derive a simple and intuitive method for quickly generating rst-order approximations of optimal hierarchy organizations. To this end, we make several simpli cations and compare the results to those obtained by more accurate simulation techniques.
The result is a simple, closed-form solution for the size of each level of the hierarchy as a function of the workload locality, the speed and cost of available technologies, and the amount of money available to spend on the system. We v alidate the model with trace-driven simulations of a three-level processor-cache hierarchy and a three-level storage hierarchy. The cache sizes recommended by our model perform close to the optimal performance as determined by exhaustive simulation.
With little money to spend on the hierarchy, the model recommends spending it all on the cheapest, slowest storage technology rather than the fastest. This is contrary to conventional wisdom, which focuses on satisfying as many references as possible in the fastest cache level, such a s t h e L1 cache for processors or the le cache for storage systems. Interestingly, i t does re ect what has happened in the PC market, where processor caches have been among the last levels of the memory hierarchy to be added. We discuss why initial money is best spent on slow technologies and hope that this paper helps to improve the intuition of those con guring caches.
The model also suggests that every dollar spent o n a n n -level hierarchy be done in a xed proportion; every dollar should increase the size of every level in the hierarchy, not just one. This is described more in the discussion of the analysis Section IV.
II. Previous Work
Countless articles have been written about memory hierarchies 17, 18 provide excellent o v erviews of CPU and disk caches, generally focusing on a two-level hierarchy 9 . Most papers in recent y ears have used trace-driven simulation to investigate such aspects of cache performance as multiprocessor cache coherence and replacement strategies. Trace-driven studies are valuable for understanding cache behavior on speci c workloads, but they are not easily applied to other workloads 17 .
Unlike traces, mathematical analysis lends itself well to understanding cache behavior on general workloads, though such generality usually leads to less accurate re-sults. Many researchers have analyzed memory hierarchies in the past. Chow showed that the optimum number of cache levels scales with the logarithm of the capacity o f the cache hierarchy 3, 4 . Garcia-Molina and Rege demonstrated that it is often better to have more of a slower device than less of a faster device 7, 15 . Welch showed that the optimal speed of each level should be proportional to the amount of time spent servicing requests at that level 20 .
These studies have had two shortcomings: 1 they assume the availabilityof memory technologies with arbitrary speeds and costs, and 2 they do not apply their analyses to a speci c model of workload locality. Being able to create and use technologies on a continuum of characteristics is convenient for analysis but makes the analysis di cult for system builders to use. Failing to apply a speci c model of workload locality makes it impossible to provide an easily used, closed-form solution for the optimal cache con guration 10 , and so results from these papers have contained dependencies on the cache con guration|the number of levels, or the sizes and hit rates of the levels.
We provide three main contributions beyond those of previous analyses:
We extend previous analyses by applying our general solution to a speci c model of workload locality. W e are thus able to provide a closed-form solution for the optimal sizes of each w orkload level as a function of two locality parameters and the device speeds and costs. We discuss what the resulting equations mean intuitively to system designers in terms of their general locality and available device characteristics. We v erify our model's accuracy against a detailed simulation of two memory hierarchies: 1 a storage hierarchy consisting of RAM, disk, and tape, and 2 a three-level processor-cache hierarchy consisting of an on-chip cache L1, an o -chip cache L2, and main memory. The performance of the cache con guration recommended by our model is almost always within 5 of the best performance obtained from exhaustive simulation.
III. Analysis
In this section, we derive an analytic solution for the size of each level in a cache hierarchy. The analysis starts with a pre-speci ed set of technologies, though the resulting equations may be used easily to choose an optimal subset of technologies from the universe of technologies.
A. The System Model

A.1 Notation
In the following analysis, a hierarchy will consist of n cache levels, numbered 1 through n, with the backing store considered to be level n+1. Fig 1 shows a typical ; the values for all c i and t i are known constants. We assume that the choice of hierarchy technologies is made rst. We also assume that any hierarchy will be a realistic one in that level i is always faster and more expensive than level i + 1 .
The total cost of the cache hierarchy system is the sum of the costs of cache levels 1 through n. F or the sake o f simplicity, w e will assume a linear cost model. 1 The system budget, B, i s g i v en by:
A.2 Stack Distance Curves Our analysis for cache performance depends on a mathematical description of workload locality and models a fully associative cache and a hierarchy that maintains inclusion. The e ects of these assumptions are discussed at the end of this section. To compute the probability of a reference hitting at a cache level, we use stack distance curves, measurements taken directly from address streams 5 .
The stack distance curves describe how many unique bytes of data separate two references to the same item. For instance, consider a reference stream A 1 C 1 B 1 B 2 C 2 A 2 B 3 , where A n refers to the nth time that datum A is referenced.
Then there are two unique data touched between A 1 and A 2 , which are C and B, and the stack distance between A 1 and A 2 is 3. There is one unique datum touched between C 1 and C 2 , which i s B , and the stack distance between C 1 and C 2 is 2; there are no data touched between B 1 and B 2 , and the stack distance is 1; there are two data touched between B 2 and B 3 C and A, so the stack distance is 3.
After normalizing, we can plot this distribution of stack distances as a cumulative probability function and a probability density function. The cumulative probability function shows, at each x value, how many references were made at a stack distance of x or less; how many references were separated from the last reference to the same data by less than x unique pieces of data Fig 2a. This relates well to an LRU-managed, fully-associative cache; a cache of size 1 will catch all references of stack distance 1 those references with 0 intervening data, a cache of size 2 will catch all references at stack distance 1 and 2, and so on. The cumulative probability function Px indicates what proportion of accesses are to data at a stack distance of x or smaller. An LRU-managed, fully-associative cache of size x would therefore have a hit ratio of Px. The probability density function is the derivative of the cumulative probability function; px describes the frequency of references at exactly stack distance x Fig 2b. Fig 2 plots the expected shape of these curves. We primarily use the density function. As we expect most workloads to have some locality, Fig 2b graphs more pairs of accesses being separated by few data than by more data. The area under the density function between X1 and X2 describes the probability of an access being separated from its last reference by more than X1 and less than X2 pieces of unique data. With fully associative caches, this is exactly the probability of a reference missing in a cache of size X1 and hitting in the next cache level of size X2. With caches that are not fully associative, con ict misses can occur, where the size of the cache is large enough to capture a reference stream but the placement policy causes additional misses. For the same reason, caches that are not fully associative m a y yield hits on data that is old but, due to the placement policy, has not yet been displaced. Fig 3 illustrates the di erence between fully and nonfully associative caches. With a fully associative cache, we can draw an exact line on the probability density curve separating cache hits from cache misses based on capacity and inter-reference distance. With a non-fully associative cache, we m a y only specify a probability distribution of hits on the probability density curve. References closer to the y-axis are more likely to be hits than references farther from the y-axis.
This paper conducts a rst-order analysis using fully associative caches. Section V veri es that this analysis is also accurate for non-fully associative caches.
A.3 Average Time per Reference
Chow and Welch use as a performance measure the average time per memory reference, and model it with the following equation:
Each t i is the time to access level i in the hierarchy, and each P i is the probability that level i will be accessed. The hierarchy maintains inclusion and the probabilities do not necessarily sum to one; the topmost level is accessed on every single reference hit or miss, so P 1 is 1. 1 = P 1 P 2 P 3 P n In our analysis, the stack distance curves are used to compute P i . The probability of accessing level i is equal to the probability that the reference will miss in all the levels above it:
where px is the probability density function. The average system time spent per reference accessing level i is thus the time to reference level i scaled by this probability:
pxdx and the total system time spent per reference is the sum of the times across all levels in the hierarchy:
The size of the bottom storage level n+1 does not appear in the equation, since this level is assumed to contain all data, so s n+1 is for all intents in nite. The time to reference this level does appear, scaled by the miss rate of the lowest cache level. As we expect, backing store is only referenced on misses to the lowest cache level. Our goal is to specify the memory hierarchy with the fastest average access time T. Speci cally, w e solve for the size of each hierarchy level s i , given the access time t i and unit cost c i of each technology; parameters describing workload locality; and the total system budget. Our solution proceeds through the following steps: 1 use Lagrange multipliers 2 to get a general solution without constraining the sizes to be non-negative; 2 apply a speci c, parameterized model of workload locality to derive a closed-form solution, again allowing negative sizes; 3 rene the solution to account for the additional constraint that all sizes be non-negative. The stack distance curves. These functions describe the degree of locality i n a w orkload. They show the distribution of how many unique bytes the workload touches between references to the same item. The cumulative probability function is a plot of hit rate versus cache size, for an LRU-managed cache. We use these graphs to compute the probability that a workload's reference hits in a given cache size.
a Fully associative b Direct-mapped The percentage of hits is therefore the area under the curve|a simple integral over the probability density function. The direct-mapped case introduces an element o f c hance, as a particular line in the cache could be thrown out on the next cache reference, or it could last in the cache for an unusually long period of time, all depending on the particular address reference stream. It is much more di cult to draw a solid line between the hits and misses; the line is blurred, as illustrated in b. However, it can be modeled probabilistically, where we know with high probability that the references at the extremes those near in time and those distant in time will be hits and misses, respectively. H o w ever, in the middle ranges, the chances of guessing correctly grow w orse, depicted by darkening shades of grey.
B.1 Calculus of Variations
First, we put the cost function B into an appropriate form and obtain the constraint function g: g = c 1 s 1 + c 2 s 2 + + c n s n ,B= 0 3 T is a function of the n variables s 1 ; : : : ; s n , as is the cost function B and its associated cost constraint g. A t the point where T is minimized, we know that:
where is the Lagrange multiplier. Combining the gradients of Eqs 2 and 3, we get ,t i+1 ps i = c i ; for 1 i n This form gives us an interesting ratio which w e will call the cost-performance ratio ij : ij ps i ps j = c i t j+1 c j t i+1 4
The behavior of ij is described later; for now, it is only necessary to note that ii = 1. Eq 1 total system cost and Eq 4 yield n independent equations, so solving for the n variables s 1 ; : : : ; s n is straightforward as long as px i s invertible.
B.2 Modeling Program Behavior
We wish to replace px in Eq 4 with a speci c function. Smith, Stone and others have noted that a cache's miss rate can be modeled as a one-term polynomial function of its size, of the form x where and are constants with less than zero 17, 19 . This follows from the 30 Rule 2 . It is also consistent with our workload traces in Section V. Thus we assume polynomial forms for Px and px in this paper; we h a v e also used an exponential form with similar results 8 .
It is easiest to start at the cumulative probability graph Px. As mentioned before, Px is related to a cache's hit ratio|an LRU-managed cache of size x would have a hit rate of Px, given the input stream that generated Px. It is necessary that Px be 0 at 0 and 1 at in nity, and ideally would have a form similar to 1 , x For simplicity and convenience we make the following changes:
we w ould like the exponent to be positive, so we m o v e x to the denominator, the function blows up at 0, so we replace x with x + 1 , w e w ant the derivative px t o h a v e a simple exponent, so we replace with , 1, and the function de ned would not have the va l u e 0 a t 0 , and also would not be unitless, so we scale the value of x directly by . This gives us the following forms for Px and its di erential px. Together with Eq 1 total system cost, we h a v e n equations and n unknowns and can solve for each s i in the following manner using s 1 . Note that every solution is linear and that some solutions can have negative v alues, or non-zero values at budget zero. This is an artifact of using Lagrange multipliers and assuming the variables s 1 ; : : : ; s ncan take o n a n y v alues, even negative ones. This is xed in the section Undoing the E ects of Negative Solutions.
Note that all values are constants except for B, the system budget; Eq 7 says the size of each hierarchy level increases linearly with the amount of money to spend on the system. The costs and access times c i and t i are constants derived from the chosen technologies. Note that the denominator is di erent for each level in the hierarchy; therefore the rate of increase is di erent for each level. The y-intercept is also di erent for every level in the hierarchy. The shape of the curves is shown in Fig 5. 
B.3 Undoing the E ects of Negative Solutions
Eq 7 can yield negative v alues for s i , particularly for small system budgets. As it is obviously impossible to have negative amounts of memory, a level with negative size should actually have zero size and not appear in an optimal hierarchy. This leads to the concept of a crossover budget for a hierarchy level i, called i . Only when the budget is greater than i does the optimal system include level i; for cheaper systems, money is best spent on other hierarchy levels. When a hierarchy level is not part of the system, it is not part of the system cost or the average access time.
To nd the crossover budgets, we note that the highest level in the hierarchy has an optimal size less than zero at small budgets. We calculate the budgets where the size is negative and remove this level from consideration at these budgets. The resulting equations are identical to the original but with a few subscripts changed. This process is repeated down the hierarchy to obtain the nal equations. The process is described by the following: 1. Each s i is a linear function of B with positive slope since every cost and access time is positive and so every ij 0. 2. Simple inspection of the relative costs and access times of the technologies shows that ij 1 when i j , and, since ji is the inverse of ij , ij 1 when i j . Therefore, at least when i = 1 when looking at the topmost cache level, 1 , 1= ij 0 for all j, and so the constant term y-intercept of the linear equation Eq 7 is negative; this means the optimal size of level i = 1 at budget B = 0 is negative. When i = n, the y-intercept is positive for a similar reason; the optimal size of level i = n at budget B = 0 i s p o sitive. This means that the analytic optimal solution has traded cache level 1 for more of level n. 3. A priori, we cannot tell whether any other s i has a positive or negative y-intercept, so we look further at s 1 . Eq 7 leads to positive solutions for level i when 4. For system budgets less than this value, we remove level 1 from consideration. Without level 1, we appropriate the budget across the n,1 remaining levels, hierarchy levels 2 through n. This leads to a new equation similar to Eq 2 but with an integral from s 2 to in nity, a s w ell as a new equation similar to Eq 1 but starting the sum at level 2. We can solve these new n , 1 equations for n , 1 unknowns in the same manner as before but without reference to level 1. This means the equations for T and B change to become functions of one fewer variables, so that leve l 1 o f t h e hierarchy a ects neither the average access time nor the budget. We n o w obtain a general solution for the size of the ith level in this reduced hierarchy:
We h a v e obtained a form identical to the original equation, except for the indices of the summations, which now sum from leve l 2 t o l e v el n. This marks the end of one iteration. 5. Applying the same procedure with the new set of equations, we see that the size of s 2 is negative for budget values
We can now remove level 2 from the analysis and obtain a new equation for s i . When this process is repeated down through every level in the hierarchy, w e nd that the cross-over budget for each level is given by where we de ne 0 to be in nity for notational brevity.
In general, we nd that 1 realistic" values: costs should monotonically decrease and access times should monotonically increase as one moves down the hierarchy to a larger i. The gures are applicable across all choices of technologies for the memory hierarchy using realistic values for costs and access times.
Given a budget and a workload characterization, and told to nd the appropriate cache organization, one would rst nd the crossover values for the levels in the hierarchy, using Eq 9. This would indicate which cache levels should be present in the hierarchy at that budget, and what value of k to use in the next step. The next step is to use Eq 10 to nd the sizes of each level at the budget value, on the interval indicated by the value of k. Alternatively, one could nd the cache sizes for every realistic budget value from zero up. Here, one would only need to use Eq 10, and use all values of k, from 0 to n.
IV. Discussion
We h a v e found a closed-form solution for the size of each level in a general memory hierarchy, given device parameters cost and speed, available system budget, and a measure of the workload's temporal locality 3 . The solution is given by Eqs 9 and 10.
A. The Bottom Line
The solution indicates how one should spend one's money. The rst dollar should go to the lowest level in the hierarchy. As money is added to the system, the size of this level should increase, until it becomes cost-e ective to purchase some of the next level up. From that point on, every dollar spent on the system should be divided between the two levels in a xed proportion, with more bytes being added to the lower level than the higher level. This does not necessarily mean that more money is spent o n the lower level. Every dollar is split this way u n til it becomes cost-e ective to add another hierarchy level on top, and from that point o n e v ery dollar is split three ways, with more bytes being added to the lower levels than the higher levels, until it becomes cost-e ective to add another level on top. Since real technologies do not come in arbitrary sizes, hierarchy levels will increase as step functions approximating the slopes of straight lines. The closed-form solution has several implications. First, note that the crossover budget for level i is always larger than the crossover budget for level i+1 and that the crossover budget for level n is 0. This means that the optimum hierarchy with a small budget between 0 and n,1 consists solely of the slowest, cheapest cache level; all other levels do not exist. This is counter-intuitive|we including the authors normally think of adding the fastest cache level rst in an attempt to speed up the average access without concern for the worst-case access. Our solution shows that this intuition is incorrect|the slower, cheaper cache level can capture more of the misses to backing store, and it is far more valuable to prevent references from having to be satis ed by a tape drive with a 15-second access time than to optimize the access time of hits higher up in the hierarchy. Once the slowest cache level is large enough to divert a large fraction of the misses to backing store at system budget n,1 , we then start increasing the next higher cache level n , 1 along with the lowest level.
The crossover budget for a given level i depends on workload and device parameters Eq 9:
i decreases with better temporal locality, that is, smaller values of or larger values of . As expected, better temporal locality f a v ors adding higher cache levels sooner.
i decreases as the devices for the higher cache levels improve in cost and speed. ij decreases with lower cost and faster times described in detail later; both these technology improvements decrease the crossover budget. To summarize our rst conclusion, money spent on a given level is money wasted if the level below it is not large enough. If the lower level is not large enough, it allows too many performance-crippling accesses to the backing store.
Second, we see that, within each region i ; i , 1 , the size of each level increases linearly with system budget. That is, within each region, additional dollars are spent according to a xed proportion.
Third, note that the slope of s i between any t w o crossover budgets is higher for larger values of i, because ij decreases with i and is in the denominator of the slope Tables I and II. in Eq 10. Thus, even when one allocates money to increase the size of a fast cache level, one still should increase the size of the lower cache level even faster 4 . The rate at which level i increases depends on workload and device parameters Eq 10:
The di erence in slopes between higher and lower cache levels decreases with better temporal locality, that is, with larger values of . With high locality, cache levels increase in size at nearly the same rate. The slope for a cache level increases as the devices for the cache level improve. ij decreases with lower cost and faster times 5 ; both these technology improvements decrease the denominator of Eq 10 and hence increase the rate that the size of this level increases.
C. Meanings and Interactions of ij , , and
Remember that ij has the following form: ij = c i t j+1 c j t i+1 = c i =t i+1 c j =t j+1 ij is the ratio of cost-performances between two levels. This suggests that each level in the hierarchy can be characterized by t w o n umbers; the cost of the technology at that level and the speed of the technology in the level beneath it. This is the e ectiveness of a given cache level; it explains how good a job the level does in cost per second cut, or how many dollars per byte it costs to save a second of references to the next lower level. These ratios combine to characterize the entire hierarchy; every level gets compared against each other in the equations.
The temporal locality o f a w orkload is characterized by the two v ariables and . The variables and ij always 4 Its size increases faster; its cost may o r m a y not increase faster. 5 Since t i does not appear in ij it is a bit more accurate to say that ij decreases as the cost of level i decreases and as the access times of all other technologies increase relative to that of level i. appear together 1 is always in the exponent o f ij . The cost-performance ratio ij indicates how g o o d a j o b one level does at reducing access time compared to another level, and scales how large each of the levels should be in relation to one another. The term tempers the e ect of the cost-performance ratio. For example, when locality i s good, is large the shape of the locality curve is steep, so 1 is small, and the e ect of the cost-performance ratio in di erentiating the levels is small. The result is that crossover budgets will be closer to the y-axis; it will make more sense to include the upper cache levels at smaller budgets. The size of di erent levels will increase at similar rates. In the traditional characterization of a memory hierarchy a s a p yramid, the di erence in sizes between one level and the next will be much less than if the locality were poor; when locality is good, it will be a narrow and tall pyramid.
When locality is poor, the shape of the di erential curve will be less steep and will be closer to 1. The e ect of ij will be more pronounced: the size of di erent levels will increase at di erent rates, and the crossover budgets will be much further out|when crossover budgets do occur, the sizes of the existing levels will be much larger than in the case where locality is high. The result will be a much broader hierarchy than in the previous case; it will take more money to add on the higher levels, and the base levels will be much larger when the higher levels do get added. Just as a workload with good locality results in a tall and thin hierarchy, a w orkload with poor locality results in a short and wide hierarchy.
The term is a scaling factor; its units are the same as s i be it bytes, kilobytes, megabytes, etc. and its e ect is to scale where the crossover budgets occur on the x-axis by scaling the y-intercepts. When the workload spans an enormous amount of data and a convenient unit for graphing is chosen to be MBytes, will tend to be large, pushing the crossover budgets further out. When the workload spans a smaller range and a smaller unit for is chosen, will be smaller, drawing the crossover budgets in.
D. Using the Model to Choose a Subset of Technologies
The model speci es the optimal size of each level with a given set of technologies. By nding the crossover budgets, the model also determines when higher levels in the hierarchy should not exist. However, the model does not automatically determine if technologies in the middle of the hierarchy should be removed. For instance, consider a hierarchy of RAM, disk, and tape, where the disk is almost as expensive as RAM and almost as slow as tape. The model will suggest the best way to arrange the three levels, given an operating budget. The model does not attempt to nd any w eak links in the hierarchy, except for levels from the top down. If the model decides that the RAM should be part of the hierarchy, the disk will also be kept. In this example, the model may suggest a con guration where the disk level is only slightly larger than the RAM level, indirectly showing that the disk technology is useless in the hierarchy.
Since the model takes only a moment to recommend a con guration, we can easily use it to choose a subset of devices from a larger pool of technologies. This is similar to Przybylski's dynamic programming approach to hierarchy optimization 12 , but it is much simpler because we can quickly search through all possible subsets. This process will nd the best organization of the best subset of technologies at a given budget point.
V. Verification
The analysis in Section III makes the following simplications:
The polynomial stack distance curves do not perfectly model a real workload. In particular, a cache would need to be in nitely large to achieve a 100 hit rate with a polynomial stack distance curve, while only a nite size is needed to achieve this with a real workload. At large budget values, our solutions recommend endlessly increasing the size of all levels; the optimal design would cease increasing the size of a level once it contained all data in the trace. The model assumes a fully associative cache model at all levels of the hierarchy. The model ignores the e ects of a block size, including a certain amount of prefetching and cache pollution. The model does not distinguish between read and write behavior. All accesses are treated as reads|they incur delay when issued rather than when forced out of the cache for consistency or cache over ow. The model ignores compulsory misses. This a ects the performance predicted by E q 2 . H o w ever, it has no e ect on the optimal hierarchy design, since cache levels of all sizes will miss these references assuming no prefetching. The performance of each technology is characterized by a single access time that does not change as the size of the cache grows.
The cost function for each technology is strictly proportional to size; there is no extra cost for the rst byte. These simpli cations make Eq 2 less accurate at predicting hierarchy performance, possibly a ecting the optimal hierarchy recommended by Eqs 9 and 10. The goal of this section is to verify that the general hierarchy con gurations recommended by Eqs 9 and 10 do indeed achieve close to optimal performance. To this end, we compare the model's ability to recommend speci c sizes for general hierarchies against simulations of real technologies. We performed exhaustive, detailed simulation of two hierarchies, using two di erent traces from real applications, and speci cations from real technologies. The rst simulation is of an AFS server with memory, disk, and optical disk; it uses a month of network le requests as the trace. The second is of a memory hierarchy with on-chip cache, o -chip cache, and main memory; it uses an application-level, virtual address trace as the workload.
A. Simulator Description
Our simulator connects together device modules such a s CPU caches, main memory, disk drives, optical disks, and cartridge tape robots. The device modules simulate the various caches and keep track of usage statistics. At each level, if the item requested is not present it is requested from the next level down. The request time at each level is the time to rst access plus the amount of data to transfer divided by the transfer rate.
In the storage hierarchy simulations, all caches were modeled as write-through with fetch-on-write 17 . In the processor-cache hierarchy simulations, the caches were modeled as writeback with fetch-on-write. None of the simulated caches are fully associative. The I O caches are setassociative; the RAM le cache is 256-way set-associative, and the disk cache is 1024-way set-associative. The ochip cache is 4-way set associative, and the on-chip cache is modeled as direct-mapped.
We simulate di erent w a ys to allocate money, where the quantum of money was $256 for the storage hierarchy simulation and $64 for the processor-cache hierarchy simulation.
B. Storage Hierarchy Simulations
We simulated a storage hierarchy similar to that of Plan 9 11, 13 , where the le system lives entirely on an optical disk jukebox and is cached by DRAM and magnetic disk. Table I describes the speci cations used in the simulator and analytical calculations for the constant v alues of the various c i and t i speci cations taken from 6, 14 .
B.1 Workload
The data that was used for the workload in the I O hierarchy simulations was collected by a logging AFS server 1 . The server sees all requests not serviced from the client's local AFS disk cache 6 . W e used one month's worth of trace The rest of the commands do not read and write data; for example, AFS uses fetchstatus to synchronize the local cache with the server. The traces were analyzed for the measurement of their temporal locality, producing the stack distance curves in Fig 7. By using a standard sum-of-squares technique to t Eq 5 to the data, we found that = 2 : 91 and = 439:68. Fig 8a shows the optimal sizes of the hierarchy levels as predicted by our analysis Eqs 9 and 10. The crossover budget for the disk level is zero since it is the lowest cache level in the hierarchy; the crossover budget for DRAM is around $3800, implying that with less than $3800 to spend on cache levels disk and RAM, all the money should be allocated to disk. Fig 8b shows the optimal sizes of the hierarchy levels as determined by the simulator. The simulator was written to take i n to account e ects that the model ignores, and the results are noticeably di erent. Instead of two regions there are really three; the model predicts that the size of the topmost level in the hierarchy will remain zero until the crossover budget; clearly, the crossover budget appears earlier in the simulated results. Also, the model predicts that after the crossover budget, the slope of the RAM line will be constant; this is not the case in the simulated results. Instead, the crossover budget appears earlier and the slope is steady for the middle region, and in the general area of where the model predicts the crossover budget to occur, the slope of the RAM curve really takes o . The region between $1500 and $4000 shows that locality can be exploited by a small amount of RAM, due to an e ect that the model ignores.
B.2 Optimal Con gurations
As is evident, the analysis predicts values that are similar to, but not the same as, the optimal values as determined by the simulator. However, this measurement is not enough; what is more important is performance lost by using the optimal con guration from the analysis. Fig 9 shows the performance of four con gurations: Fig. 9 . Performance comparisons for the storage hierarchy. The x-axis represents system budget; the y-axis represents the average time per simulated reference. The times include compulsory misses. The hierarchy consists of two levels, RAM and magnetic disk; backing store is an optical jukebox. The comparison is between the measured optimal running times and the running times of the predicted optimal con gurations. Also shown are the running times of an all-RAM system and the worst observed con guration|onewith a half gigabyte cache on disk and the rest of the budget devoted entirely to RAM. The amounts of RAM in the last two con gurations are not insigni cant; they approached 200MB at high budget values.
the optimal con guration found by simulation, the con guration recommended by the analysis, an all-DRAM con guration, and a con guration with 512 MB of disk with the remaining funds devoted to DRAM worst observed case. As Fig 9 shows , the predicted optimal con guration never performs more than 5 o the real optimal conguration. In contrast, two reasonable con gurations all DRAM with no disk; 512 MB disk with the remainder going to DRAM perform as much as 50 worse. Thus, though the con gurations recommended by the analysis di ers from the optimal con guration, no performance is lost by using the analytic model. H o w ever, the time saved by not needing the simulations was substantial, as the simulations ran for many months of computer time while applying the analysis took about 30 minutes to write an appropriate Maple script, and less than a second to execute it. of system budget. The x-axis represents the amount of money available and the y-axis represents the optimal size of each level in the hierarchy. The optimal con gurations measured by simulation are more accurate, as the simulator takes into account a n umber of things ignored by the model: writes, block size, access time variance, and real traces.
C. Processor-Cache Hierarchy Simulations
We also simulated a typical three-level virtual memory hierarchy consisting of on-chip cache L1, o -chip cache L2, and main memory. W e used pixie-generated traces of several programs from the SPEC92 benchmark suite: dnasa7, espresso, hydro2d, mdljdp2, mdljsp2, su2cor, and wave5. These programs were chosen above others because of their large cache footprints, necessary for making multilevel cache simulations run in a reasonable amount of time. Table II describes the speci cations used in the simulator and analytical calculations for the constant v alues of the various c i and t i .
The cost of on-chip cache needs a bit of explanation. In one sense, on-chip cache is completely free; it is a necessary part of the design and cache appears on nearly all microprocessors. On the other hand, it is in nitely expensive because you cannot arbitrarily increase its size. In an attempt to nd a feasible analytical medium, we assumed that a CPU costs around $1K, roughly half its space is devoted to cache, and a typical cache these days is around 32 KBytes. Thus the $16K per megabyte number. C.1 Workload
The traces were analyzed for the measurement of their temporal locality, producing the stack distance curves in Fig 10 . By using a standard sum-of-squares technique to t Eq 5 to the data, we found that = 2 : 05 and = 2 4 : 52. C.2 Optimal Con gurations Fig 11a shows the optimal sizes of the hierarchy levels as predicted by our analysis. The crossover budget for the L2 cache is zero since it is the lowest cache level in the hierarchy; the crossover budget for the L1 cache is around $500. Since the curve t is so inaccurate, we applied our analytical approach to the raw data instead of a polynomial and obtained the graphs in Fig 11b. The con gurations that were optimal according to the simulations are shown in Fig 11c, with error bars demonstrating the con gurations that will perform within 10 of optimal any appropriation of system budget within the error bars would result in a performance within 10 of the simulated optimal. Note that the granularity of the analytical graph is smaller than the simulations; the simulations allocate funds across the hierarchy in $64 quanta while the analytical scripts allocate funds in $16 quanta. Fig 11c shows the optimal sizes of the hierarchy levels as determined by the simulator. Again, the simulator was written to take i n to account e ects that the model ignores, and the results are slightly di erent. Compared to the results taken from the polynomial curve t, the simulator results are much like the I O hierarchy results; the model predicts two regions where the slopes will be constant, where the simulated results have several regions and the slopes are not constant. The L1 cache appears earlier, but its size does not really take o u n til after our predicted cross-over budget. For small budgets, the data suggests that L1 cache is more e ective at reducing access time than L2 cache; this is consistent with the probability density curve in Fig 10. Most of the references lie within a 125KB stack distance, and very little lies in the region between 125KB and 500KB, which suggests that the L2 cache will not be truly e ective u n til the budget allows a cache size of 500KB. This compares well with the sharp jump in the L2 cache size and sharp decline of the L1 cache size at the $500 budget point. The x-axis represents system budget; the y-axis represents the average time per simulated reference. Note that the times do include compulsory misses. The hierarchy consists of two levels; On-Chip cache L1 and O -Chip cache L2; backing store is main memory DRAM. The comparison is between the measured optimal running times and the running times of the predicted optimal con gurations. Also shown are the running times of an all-L1 system and an all-L2 system. The worst observed con guration is exactly the all-L1 con guration.
If we compare the simulated results to the analytical results taken from the raw data, there is much more similarity; both show that L1 should be present at small budget values, and its size should take o around a budget of $1000. However, this is less informative than the actual system performance. As before, we compare the performance of the simulated con gurations to that of the optimal con gurations. Fig 12 shows the performance of ve con gurations: the optimal con guration found by simulation, the con guration recommended by the analysis applied to the polynomial curve t, the con guration recommended by the analysis applied to the raw locality data, an all-L2 cache con guration, and the worst observed case which happens to be an all-L1 con guration. As Fig 12 shows, the con guration predicted via the raw data performs within 5 of the real optimal con guration, except for the one place where it predicts that the L1 cache size should be 0. Here, the performance is several times worse than the simulated optimal con guration. The analytical results using the polynomial tted curve are equal to the all-L2 con guration until the crossover point at $500 and from there, drop down to within 5 of the optimal curve, and remain within 5 from that point on.
In contrast, two perfectly reasonable con gurations all L1 with no L2; all L2 with no L1 perform as much as 800 worse. Again, though the con gurations recommended by the analysis di er from the optimal con guration, no performance is lost. Fig. 10 . Stack distance curves for SPEC traces. The cumulative probability curve is shown with its curve t on the left, the probability density is shown on the right. Collected data is shown with dotted lines; the curves t to the data are shown in solid lines. The region between 125KB and 600KB of the probability curve is non-zero; the scale required to show the other data obscures this.
VI. Conclusions
In this paper, we h a v e derived a simple model for determining the optimal size of each cache level in an n-level cache hierarchy. The model is based on the access time t i and unit cost c i for each level, the total system budget for the cache levels B, and a two-parameter characterization of workload locality and . By using a speci c form for the stack distance curves, we w ere able to derive closed-form solutions for the size of each cache level s i on the interval i ; i , 1 Our model led to four observations about con guring cache hierarchies. First, it is common to focus erroneously on hit time rather than miss time in designing hierarchies.
In contrast, the model shows that miss time is more important u n til the system budget is large enough to achieve high hit ratios in the lowest cache level. This implies that the rst place to spend money when designing a cache hierarchy is the cheapest level, rather than the fastest level. As speci c applications of this principle, CPU cache designers should be aware that a larger o -chip cache may yield better performance than a faster, but smaller, onchip cache. Also, when tertiary storage becomes more common, storage hierarchy designers should be aware that having enough disk will be much more important than having enough RAM.
As a corollary to our rst observation, the model recommends increasing the size of the slower cache levels faster than the size of the faster cache levels, even when it makes sense to include those faster cache levels.
Third, we s a w from the model that within each crossover budget interval i ; i , 1 , the size of each level increases linearly with system budget.
Fourth, we observed that the workload locality had an interesting e ect on the shape of the memory hierarchy. For workloads with good locality large and small , the optimal memory hierarchy is narrow. That is, the dif- Fig. 11 . Optimal con gurations of the processor-cache hierarchy. The optimal con gurations, both predicted a, b and measured c, as functions of system budget. The x-axis represents the amount of money available and the y-axis represents the optimal size of each level in the hierarchy. The optimal con gurations determined b y simulation are more accurate, as the simulator takes into account items ignored by the model, including writes, block size, access time variance, and real traces. The data in a were obtained from the polynomial curve t, while the data in b were obtained from the raw cumulative probability data itself. The error bars in c indicate the con guration ranges of L1 and L2 that will give performances within 10 of the optimal con guration.
ference in size between cache levels is small. Inversely, for workloads with poor locality small and large , the optimal memory hierarchy is wide; that is, the di erence in size between cache levels is large. His research i n terests include operating systems, distributed systems, computer architecture, and algorithmic composition. languages, VLSI design, and computer vision, and he holds a patent in computer-aided design of VLSI circuits.
Peter
His current research includes the design and construction of a high-performance microsupercomputer, computer architecture, computer-aided design, and operating systems. In addition to his position as a faculty member, h e i s a consultant for several computer companies in the areas of architecture and CAD. He is also Associate Editor for ACM Computing Surveys. Trevor Mudge is a Fellow of the IEEE, a member of the ACM, the IEE, and the British Computer Society.
