Private Bag 9201 9, Aucklund New Zealand cthomborQcs.auck1and.ac.nz
Introduction
The cost of a high-performance computer system is, in many cases, determined more by its memory subsystem (caches, RAM, disk, and thie connecting buses) than by its instruction-processing subsystem (CPUs, FPUs, ALUs) ' .
When designing efficient software, memory usage and sharing patterns can be more important considerations than minimizing iinstruction counts, as noted by many authors, recently including 5,13,9. Traditional algorithinic analysis, however, is based on counting instructions under an increasingly-misleading assumption of "unit-cost'' memory accesses. Users have become accustomed to paying for their computations by the CPU-second, even though CPU resources are available at incredibly-low price on desktop systems, albeit with much less memory support than on supercomputers or mainframes.
Our analyses are based on, and serve to justify, the novel idea that computational "work" W can best be measured as the product of the total number of references R multiplied by the size S of the memory space in which these memory references are made:
Our R is a count of references to appropriately-sized blocks, with no temporal locality between these block-references. By "no temporal locality," we mean that the block addresses are uniformly distributed in the space of size S . By "appropriately-sized blocks," we mean that we would increment our reference count R by [n/B1 every time a user's code read or wrote n consecutive bytes in the address space. The block size B must, of course, be carefully chosen so that it is appropriate for any contemporary memory technology that could be used to build a system of capacity S.
Well-designed user codes exhibit temporal locality. We capture this effect in our analysis by requiring that memory systems be "hierarchical" in their performance characteristics, in the following sense. A memory of capacity S must be supported by a smaller, faster memory which in turn is supported by smaller, faster memories. The recursion ends at the CPU registers. The supporting memories should service enough of the temporally-local references into S that the computer system is latency-bottlenecked on the large, slow (size-S) memory rather than the smaller, faster layers.
Our analyses are based on the belief that most computational problems of economic importance can best be solved on latency-bottlenecked computa tional systems. We do not analyze CPU bottlenecks and memory-bandwidth bottlenecks, because processor and memory parallelism is relatively easy to increase in contemporary, scalable, computer systems. Latency reduction is, in comparison, an expensive proposition; as persuasively argued in 18, latency forms a fundamental "wall" on system performance.
A narrower reading of our work is independent of a belief in the fundamental nature of latency. In this reading, our economic analyses apply only to codes that are latency-bottlenecked on computer systems that are well-modeled by the method we describe in Section 3. For CPU-bottlenecked codes, traditional measures of algorithmic work (in computational steps or CPU-seconds) are more appropriate. For bandwidth-bottlenecked codes, it would be appropriate to charge users on the basis of their contribution to bus contention or memory saturation.
Yet another valid, but still narrower, reading of our work would largely ignore our analytic model of Section 3. Such a reading would consider only the immediate practical implications of our characterizations of Work and Quality on computer hardware system design, operating system design, and benchmark design. Our metrics would be viewed, in this reading, as an incidental addition to the standard measures of system performance, where each measure is to be considered on its own merits depending on the situation. We have written Sections 1 and 2 to be independent of our analytic model, to aid in such a narrow reading.
For latency-bottlenecked codes, our measure of computational work W = RS is the natural choice. A convenient unit for W on contemporary systems is megareferences . gigabyte, or equivalently petisreferences . b,yte. We will abbreviate this unit as a "preb" . For example, a million random accesses into a one-gigabyte database is 1 preb of work. This is the same amount of work as a thousand references into a one-terabyte database, or a billion references into a one-megabyte database.
To convert prebs to dollars, we must consider the cost and latency of the memory fabric. For example, a large high-performance disk memory (and its associated DRAM and CPU subsystems) might have a total purchase cost of $0.10 per megabyte. We might plan to amortizle this purchase: cost over a million seconds of continuous use, or a few weeks,, bearing in mind its rapid obsolescence, the impossibility of attaining 100% utilization, ancl the costs of capital, maintenance and operation. We should thus "rent" disk memory at $0.10 per megabyte per lo6 seconds, that is, at 1E-13 dollars per byte.second. If a random access into this memory (to an appropriately-sized block) takes ten milliseconds, our charge should be 1E-15 dollars per byte.reference, or more succinctly, $1 /preb.
We formalize the analysis of the previous paragraph by defining the "quality" Q of a memory system as where L is its random-access latency (in seconds), and C' is its per-byte purchase cost C amortized over a million seconds. The constant of proportionality in our conversion from C to C' could be adjusted tlo acheive any desired amortization rate other than our assumed million-second writeoff. Note that we apostrophize C' to indicate that it is a time-derivative of the purchase-cost C.
A convenient unit for expressing Q is prebs per dollar. We call this unit a "qual." Our analysis thus far has neglected one impor1,ant aspect of rnodern computing, namely parallelism. A well-designed code will mitigate a latency bottleneck by running many memory accesses simultaneously. We capture this effect in our model with the notion of "effective quality," described below.
We say that a multithreaded (or otherwise parallelised) code has an effective quality of
where P is the effective degree of memory parallelism, and Q is the quality of the memory system. Our notation is meant to suggest that the multithreaded code could be charged as though it is running in single-threaded mode on an imaginary memory system of quality Q e~. Anothier, equally valid, interpre-tation is that users who multithread their code appropriately will receive a "discount" of Qeff/Q = P .
For example, a user who rents a hundred gigabytes of space, distributed over ten disk devices, should be allowed to run P = 10 disk operations :'for the price of one7' if these operations are issued concurrently to separate devices. If the operations were not issued concurrently, or not issued to separate devices, then the user's effective parallelism P would of course be less than 10.
We are now able to make some rough calculations of the effective quality of typical large-memory architectures available today. For example, a 1024-processor SGI Origin 2000 (S2MP) has a latency L of approximately 2 microseconds into its shared DRAM of at most 2 TB. The effective parallelism into shared DRAM may be as large as Pmax = 4096, because up to four cache misses can be outstanding per processor at any time5. The cost C of this DRAM (and its supporting buses, CPUs, etc.) is, we would imagine, about $50/MB in large quantities; that is, a 2 T B machine might cost $100 million. These figures give Q = 1/LC = 10 quals for single-threaded (that is, untuned) 2 T B computations. The maximum effective quality Qmax for this architecture is some 4096 times larger, or 40960 quals. The largest actuallyacheivable quality Qeff will, of course, depend critically on code design and on architectural constraints that may limit effective parallelism into DRAM. Still, if a computation obtains an effective parallelism of P 2 200 on this 10-qual shared-memory architecture ( Q e~ = PQ 2 2000), it is economical in comparison to a workstation with 4-way interleaved (P 5 4; Q = 500; P Q 5 2000)
DRAM.
In Table 1 , we have listed our best current estimates for the maximum capacity S in gigabytes, latency L in seconds, system cost C (divided by the maximum capacity) in $/MB, and the guaranteed not-to-be-exceeded "speedof-light" effective parallelism Pmax for several architectures including a "network of workstations" connected by a low-latency ATM switch. We have calculated the quality Q = l/(LC') for these architectures: this can be interpreted as a lower bound on the effective price-performance for latency-bound, singlethreaded computations of size at most S. We have also calculated an upperbounding Qmax that would be observed by any code achieving an effective memory parallelism of Pmax at latency L on that architecture.
Unless our estimates are wildly inaccurate, one implication is clear: costsensitive users with problems requiring at least 1 GB of memory should consider developing parallelizable code for a supercomputer rather than a PoPC or a NOW. This is an illustration of a general rule: highly-parallel systems are cost-effective if the parallelism is actually used17. The wide gap between Q and Qmax for supercomputers should of course be a matter of concern. Depending on the memory parallelism P actually acheived, a supercomputer might yield either excellent or poor cost-performance in comparison with, say workstation DRAM.
Our economic bias should be clear by now: we would choose a computer system on the basis of price-performance Qef, subject to feasibility constraints on S 5 Smax and perhaps on latency L . Less cost-sensitive users would choose on the basis of L , with feasibility constraints on ,4 and Q e~. Our economic theory would cover either case, however for simplicity in exposition we will assume that the reader of this paper shares our economic bias.
In Section 2, we make more careful definitions, and explore some of the implications of our charging model on system design and operation. Section 3 examines a wide range of contemporary architectures, analyzing their latency, bandwidth, and parallelism characteristics under our model of "well-formed" memory hierarchies. Section 4 contains a summary of our contributions, and an outline of the research frontier in this area.
2
We define latency L using the "back-to-back load" notion of the lmbench
That is, our L is the multiplicaitive inverse of the rate at which a chain of pointers can be followed through memory. We do not subtract a CPU clock period from the back-to-back load time, however, as in lmbench's output routines. Also, we do not follow lmbench's fixed-stride assumption. Instead, we assume that the addresses of the cells in the chain are random variables, unpredictable by the CPU or the memory system, uniformly distributed over the entire address range of the memory layer. This suggests that our I; is somewhat larger than the unit-stride latency measured by lmbench. However, our L, since it is averaged over all strides, is smaller
Definition and Implications of Work and Quality
than the worst-case latencies measurable by lmbench. The definitional confusion over L , described briefly above, arises because we must make some appropriate assumption about temporal and spatial locality of addressing sequences, in order to make a meaningful definition of latency. Our definition is based on the theoretical model of Section 3, although it may also be justified informally as follows. If there is spatial locality in a reference stream, then it should be accomodated efficiently by choosing an appropriate blocksize. If there is temporal locality in a reference stream, then it should be accomodated efficiently by caching in smaller, faster layers. Thus, to a good first approximation, there should be very little spatial or temporal locality in the reference streams seen in each layer of large-scale, hierarchical memory systems, beyond that captured by the blocksize and caching mechanisms. We are not particularly concerned with modelling stride-k vector-memory operations, for fixed k > 1, as a phenomenon that is somehow distinct from other forms of memory parallelism. For example, we would count a stride-$ load of vector length 100 as 100 independent one-word loads. Most highperformance memory systems will exhibit large effective parallelism P under such conditions. We would assume P = 100 in this case, for memory layers with parallelism of at least 100 under our analytic model of Section 3. A final, and fundamental, difference between our latency L and that measured by lmbench is that we would measure a block-load latency, where the blocksize B is suitable to the memory layer in question. We will return to this point later in this section, for now noting only that in Section 3 we will define the appropriate blocksize as B = L / G , where G is the average "gap" (measured in seconds/byte) * in the data stream resulting from the memory operation. We suggest the use of a fully-depreciated cost including an allowance for the buses, backplanes, etc. that are required for memory upgrades. For example, a commodity DRAM chip might cost $10/megabyte at the present. In an efficiently-managed computer center installation, it might be appropriate to amortise each DRAM chip over lo7 seconds of operation. For DRAM in a lightly-utilised workstation, amortisation over lo6 seconds of use might be more appropriate. In either case, we would expect the purchase price of the DRAM chip to be recovered within three to six months (ten to twenty million seconds). In another six to twelve months, the cost of the motherboard, system maintenance, operation, and other fixed and variable costs might be recovered. The resulting payback period is, we believe, typical of modern corporate investment decisions, In this paper we generally assume, when illustrating our analytic techniques, that a million-second amortisation is appropriate for all memory devices. We recommend that all our readers examine this assumption carefully, adjusting our calculations as necessary to fit any slpecific installat ion.
Side constraints. In this section we will assumle that contemporary architectures are reasonably well-designed, and thus that side-constraints are unnecessary. However, if in the future our Qeff.= P/(LC') metric were used to design or market systems, then sideconstraints and/or other performance metrics will become necessary. Otherwise we may Find ourselves purchasing a system with huge amounts of low-latency, highly-parallel memory, with insufficient memory bandwidth and CPU resources for i%ny useful computation! Arguably the best approach to defining and enforcing side-constraints on "quality" memory systems, other than the analytic technique sugp;ested in the next section, or a "balance" formula l o , would be to develop a benchmark suite of problems with known work W = RS. A:$ a simplistic example, we could model a large, batched, database-query problem as a million 1000-way gather operations ( R = lo') on a 100 gigabyte dataset ( S = lo1') composed of a hundred billion 1000-byte records. The addresses in these operations should obey some reasonable (perhaps Zipfian) locality-of-reference rule, to permit some (small) speedup from caching the results of prior gathers. Also, the addresses for each gather operation should ble computed s a function (perhaps XOR) on the result of the previous gather, to prevent more than 1000-fold memory parallelism. This benchmark would require at least W = RS = 10'. 10'' = lo5 prebs (peta reference-bytes) of work on an,y computer, as long as we all agree that a "reasonable" blocksize for a 100 GB memory must be at least 1000 bytes. Somewhat mqre work W may be required on inefficient systems, as discussed below.
A system costing C dollars that completes lo5 lprebs of work i,n T seconds must be delivering 105/(C'T/106) prebs/dollar, that is, its effective quality Qeff is 101'/(C'T) quals.
If bottlenecks such as inefficient operating-system paging policies, memory bandwidth, synchronization, or address-translation impede the workflow, then this will be reflected in the Qeff calculated for the known-W benchmark suite. These bottlenecks could be discovered and perhaps repaired more easily, we believe, if the W measure were reported by work-sensitive system accounting routines.
A work-sensitive accounting scheme. Our work-sensitive vision of system accounting is easily described. Each user should be notified of the work W = RS done on their behalf by the operating system during each reporting interval, which might conveniently be a second, an hour or a day. In order to allow such a report to be made, the computer system must count each user's memory references on each layer of the memory hierarchy, and estimate their memory occupancy S; at the time of each reference at memory layer i . The work-update rule is thus W; = Wi + S; triggered whenever a reference occurs on layer i. Ideally, this statistic would also be gathered for each of the user's process groups, processes and threads to aid in performance tuning.
Users could be charged by the actual work W; done on their behalf. Under this scheme, users with large memory allocations will be charged more for each cache miss than are users with small memory allocations. The cost-sensitive user would certainly notice, and complain] about the charges under this scheme if their job were run with an inefficiently-large memory allocation! Similarly, if their job were run with an inefficiently-small memory allocation, they would incur extra charges under this scheme due to excessive page-fault activity. To put it another way, charging by work-done would put pressure on designers to deliver systems capable of work-efficient operation. It would also put pressure on users to run codes that are capable of being run efficiently, and to choose systems that can run their codes efficiently.
Our favourite accounting scheme. Our favourite proposal for system charges is somewhat simpler to explain and to implement. Each user should "rent" their memory capacity at a fixed charge per megabyte-second. This rental charge should be set high enough to cover an appropriate amount of "active usage" of this memory. Roughly speaking, someone who is renting half of a memory layer (say, occupying half of the capacity of a paging device) should get half of the total service (e.g. page faults/second) available from that layer. In our economic model, it is easy to establish an appropriate rental charge. Someone renting space S should be charged C' dollars per megabyte-second. This user should be assured of the possibility of obtaining up to P / L memory references per second, at a latency of L , if their code is properly tuned to take advantage of the available parallelism. Exact values of P and L may be difficult to obtain, but roughlyappropriate values could be obtained by analytic means (see our Section 3), b! y an appropria1,e benchmark, or by observation of the effective quality being delivered on this system in some prior period.
Our "rental" scheme for charging has the virtue of simplicity, but it does pose some risks. The user must trust the computer center to provide systems capable of work-efficient operation, and to operate these systems in a workefficient manner. The computer center must trust the user, or better, their resource scheduler, not to allow CPU-bound anld bandwidth-bound jobs to impede the progress of latency-bound ones. Only the latter "pay the rent." The others will get a "free ride" under this accounting scheme, implying that our rental scheme will only be attractive to computer centers if and when most jobs of economic importance are latency-bottlenecked.
Accounting for quality. A work-sensitive operating system should, we believe, collect statistics on the effective memory quality Qef = P/(LC') being delivered on each layer of memory to each of a user's threads, processes, and process groups. This will help knowledgeable users understand1 the charges they incur, and perhaps give them enough informalion to tune their codes. The value C' is a constant, depending only on the layer in question. The value P is an estimate of the effective parallelism (queue length, or number ofoutstanding requests) for that thread, process or group on that memory layer Ideally, this would be evaluated at the time of each reference-completion, along with the latency L incurred by that reference and the time t since the previous reference-
completion. A suitable update rule would then be (2 = ( l -t / T ) & f t P / ( T L C ) ,
where T is an appropriately-large constant for sporadic time-averaged reporting. Each time this report is printed, the knowledgeable user will also want to see the current total W and the difference in W since the last report. These last two statistics are the analogues, in the work-accounting scheme, of total CPU-seconds and %CPU in traditional system accounting.
To aid in code-tuning, quality Q = P / ( L C ) values might be averaged over distinct, fixed-length, time intervals and reported as a time-series to a user interested in profiling a task's memory performance. If work W = RS values were also averaged and reported over the same intervals, then a knowledgeable user would be able to spot time periods in which their code was not running efficiently. Additional statistics, especially the corresponding time series for P , L and S would help to diagnose the problem.
These measurements of Q and W seem to us to be feasible for all existing systems, with the possible exception of the fastest (processor cache) layers of the memory hierarchy. If measurements on the fast layers are infeasible, we would recommend estimating them by inference from the usual CPU-second measurement, under the assumption that the CPU is fully saturating its caches at all times. We can now state our third, and most complex, proposal for a rational charging structure. Each user could be assessed S/Q dollars each time a memory reference is made on their behalf, on each layer of memory for which S and Q statistics are being collected.
Choosing a scheme. From a theoretician's perspective, there is only a "constant factor difference" in our charging schemes. As long as there are only a constant number of chargeable memory layers, and as long as the mostexpensive layer has a non-zero load average, then it doesn't matter "to within a constant factor" whether we charge only for usage on the most-expensive layer, or for usage on all layers.
We would thus suggest that market analysts and perhaps systems designers, rather than theoreticians, be employed to choose an accounting scheme. We have suggested several alternatives, each of which has some "theoretical" virtue and some "rational" basis. Our favourite space-rental scheme is quite simple. Our work-charging scheme (under an assumed constant Q) is arguably inappropriate when running custom codes, because the effective Q for these codes will be difficult to estimate accurately. It might, however, be appropriate for users running standardized, well-tuned codes with predictable Q: perhaps linear-program solvers might fit in this category. Our final proposal, that of charging by S/Q with both S and Q being runtime estimates, has many nonlinearities hence will not be easily comprehended. It has the signal virtue, from the users' perspective, of putting responsibility squarely on the shoulders of the computer center to find some economic way to run any code, no matter how wildly variable its memory demands may be.
3
Due to space constraints, we present here only the definitions of our analytic model. Anyone interested in justifications or explanations is referred to our other writings on the subject l 4 > l 5 and to related works by other authors
The key variables in our performance model are space S, latency L , and effective parallelism P . We treat S and L as independent variables in our model, placing analytic constraints on parallelism P , gap G and blocksize B. As noted earlier, single-stream memory bandwidth is 1/G and blocksize B = L/G. s. -sf') = aNote that Sh = S. For example, the network of workstations i n1 Table 1 has S = 140 GB, so h = 5 and 6 = 1.358.
Latency Li is the latency for a transfer from layer i to layer i + 1: L; = Lo(si/so)cy (8) where an appropriate value of a can be determined by noting that Lh-1 = L and LO = lo-' seconds (9) for contemporary high-performance microprocessors, yielding For example, the network of workstations in Table I has L4 = 20 microseconds, so a = 0.57.
Similarly, we define blocksizes B; i~s power functions on a current technological parameter Bo (the size, in bytes, of a minimally-efficient transfer to register from L1 cache) and some as-yet-unknown ,/I: 
This constraint implies that parallelism grows as the square root of the total number of blocks in a layer. We believe that our analytic model is accurate to within a factor of ten or so on all current architectures based on high-performance microprocessors. We hope to increase its accuracy, perhaps to within a factor of five, in future work.
Summary and Open Questions
We have defined a novel concept of work W = RS as the product of randomaccess block-references into a space of size S. Strictly speaking, this concept is only applicable to latency-bottlenecked tasks. If appropriate side-constraints on CPU and memory bandwidth are enforced, however, work W would be a fully-general measure. Alternatively, work W should be seen as the analogue (for latency-bottlenecked tasks) of CPU-seconds and total memory bandwidth (for CPU-bottlenecked and bandwidth-bottlenecked tasks, respectively).
We have defined memory quality Q = l/(LC') where L is the latency and C' is the fully-amortized per-byte cost of memory. We further defined Qeff = P/(LC') as the effective quality of a computational task acheiving (memory) parallelism P . This notion of cost-performance is generally apprcpriate for latency-bottlenecked systems, that is, for systems with sufficientlyinexpensive memory bandwidth and CPU resources. In other settings, Q and Qeg should be viewed as analogues of traditional measures of cost-performance, for example MFLOPS/$, MOPS/$ and MBW/$ for FPU-, CPU-and memory bandwidth-bottlenecked systems respectively.
We made rough estimates of the quality of several current architectures, including piles of PCs (PoPCs), networks of workstations (NOWs), sharedmemory multiprocessors, etc. These estimates could be refined by someone with more knowledge of the actual parameters of existing systems. We plan to develop benchmarks to measure quality directly.
Our notion of memory quality supports many economic analyses. We briefly outlined its implications for charging schemes at computation centers. In particular, we offer a rationale for the idea that computation centers should "rent" memory space at a fixed charge per byte-second, making no extra charge for standard "computational services" on this space (references, bandwidth, CPU cycles, etc.).
We sketched some performance monitoring techniques that would, we believe, be useful to anyone interested in developing work-efficient codes. We intend to write user-level libraries to provide some primitive support for these techniques on standard-issue workstations and, perhaps, on ain SGI Power Challenge.
We defined an analytic model of performance that seems ca.pable of predicting hierarchical bandwidths, parallelism, latencies and sizes, given top-level memory size S and latency L , to within a factor of ten, for existing systems based on high-performance microprocessors. Although our model is still somewhat sketchy in its details, it gives concrete support to the notion that there exists an "appropriate blocksize" for a random reference. Lacking such a model, or some other (benchmark-based?) agreement on Iblocksizes, our definitions of work W and quality Q would be incomplete.
