I. INTRODUCTION
Contemporary large computer systems freo.uently employ a memory hierarchy such as that in figure 1, where we show cache memory, main memory, drums, disks and tape. The efficient use of this memory system is crucial to the operation of the whole computer system.
In this paper, we shall examine the memory hierarchy both overall and with respect to its components in an attempt to identify research problems and project future directions for • development.
The effects of • the design of the memory hierarchy can be considered to fall into two (overlapping) areas: performance and logical view. Performance denotes those aspects of the hierarchy design which affect the measures of performanee of the computer system, such as throughput, speed, response time, turn around time and cost effectiveness. Logical view refers to the logical view given the user of the memory system: how is the memory addressed?, named?, where is the information?
(virtual vs. real location), how is this information protected?, etc. These two aspects interact, since performance is impacted by the logical view, and the cleanliness or uniformity of tne logical view is often impaired by attempts to easily allow the user to tune the system to improve its performance.
By far the most fertile direction for new results (research or development) is in the study and design if memory hierarchies of the future, rather than in the optimization of current systems.
( CPU )-»("CACHE }-

32K bytes 4 megabytes 3 MIPS
Ficur: 1 5-10 gigabytes 1000 to 50000 reels
In figure 2 , we show what we believe represents the type of large computer system mesory hierarchy that will become common in the earlv 19oQ's. To figure 1, one will note that we nave added in figure 2 two levels: gap filler technology and mass store. Currently, there are orders of magnitude difference (the access gap) in both cost and performance between random access (M0S) memory ar.d mechanical storage devices such as drums or disks. Much time and effort is expended in most computer systems in finding efficient ways to accomplish the necessary transfers of information accross the access gap. A computer system using a level of storage whose technoloey occupies the access gap could benefit significantly in both improved performance and decreased system complexity. Even though ".he cache memory in high end machines is very fast, the processor logic is even faster; thus, large hign speed computer systems are memory speed limited. Because of this requirement for high speed, the implementation details of the cache are at least as important as the more general algorithmic features of the design. Our discussion will trend toward the latter, but the importance of the former should not be neglected.
Cache memories will become larger and faster, and will appear on more and more macnines. The speed of the cache is dependent on circuit technology, which is improving, and on physical size, which lower bounds propagation time.
The capacity of cache memories is also limited by two factors: cost and physical size (cabinet and board space). Projected increases in density and circuit speed should aid in solving all of these problems. Tne largest cache memory to be found in an IBM comoatable machine is the btK byte cache in the 3033 processor, first delivered this year. If recent trends continue, this maximum capacity can be expeCed to double about every 3 years.
Simple cache memories are now appearing in small minicomputers. Some micros already have some of tneir addressable memory located on the same chip, which cakes it more quickly accessable and oucn like a cache memory, although it is not architecturally transparent. It seems clear that as soon as circuit technology permits (1980?), small hardware managed cache memories will appear on high end microprocessors (off chip access is slower, even for the same technology, than on chip access). This represents a dramatic change from as recently aa 10 years ago, when the introduction of a cache memory on the IBM 360/85 [25J was a major advance in computer architecture.
. The performance issues in cache memories concern two goals: maximizing the probability of finding needea information in the cache and minimizing the time to access it if it is there. Host of tne published research concerns the former. The work on c.-:-u ,e mapping algorithms [6,35] is concerned wi • :ne first problem. A subset of the cache memory *:. always searched (in some sense associativelyj , and " the problem is to select the extent of the soarch. If the address can map to a large number of iocations in the cache, there is a higher probability of finding it, but looking takes longer. Tnis is a well understood problem (see [22J for some data) and set sizes of 2 to 8 are commonly chosen.
Selecting the size of the information transfer unit (line size) is also a well understood problem: line sizes cf the order of 32 bytes (e-6t bytes) seem to be standard. Prefetching information before it is needed [36] is quite useful for cache memories, although it is not generally implemented.
The access time issue, mentioned above, leads to two possioie changes in cache architecture, neither of which has been fully evaluated in the published literature. The cache is generally used for both instructions and data.
Instructions are accessed bv the instruction fetch and decode (I) unit of the"CPU, whereas the data is used by the execution • (£) unit.
The I and E units are relatively separate, can both be simultaneously active, and are usually physically removec from each other in the CPU. If each o:' the I and E units had their own cache, access time could be decreased and bandwidth to the cache increased. The proolea is that the same piece of information may be ir, both the instruction and data caches (especially in current architectures, where instructions can be modified), and th_s consistency is a problem. This consistency problem is the same one that occurs for multiple CFLI's. each with its own cache, and can oe dealt with in the same way as discussed below. Software strategies can also be used.
Two computers, the S-l [26j and IBM's 801 [13]1 both of which represent brand new architectures, have implemented a split cache. The effectiveness of this idea has been studied only once in the published literature [iJO] ; both that work and work by the author show that there is a very significant penalty in such an organization in terms of increased miss ratio. Further work is required, though, to see if such an organization is desirable because terms of its access time advantages. In particular, the miss ratio increase (vs.
total bytes of available cache) needs to be quantified, the consistency problem needs to be looked at and the relative size of the two (I/E) caches needs to be determined.
Most large cozauter systems have virtual memory, by which the (virtual) addresses used by the process are maapea into real physical mair memory locations. This is done conceptually by ? aee and segment tables, but to soeed access, a ranslation Lookaside Buffer (TLB) is employed. The TLB maintains the correspondence between recently used virtual and real memory addresses, so that the segment and page tables seldom have to be referenced. The cache memory in current machines is accessed using a real address, which implies that every cache meraorv access requires prior virtual to real translation through the TLB. To a large extent the translation ana lookup can occur in parallel (all of the relevant iines of the cache are read ou' initially using the virtual address, and then a selection is made among them using the now available real address [1]) but some time is still wasted. A possibility that has not yet been carefully evaluated in the published literature is that of a virtual address cache, in which virtual addresses are used to access the cache and reai addresses are used only to access main memory. There are problems with this approach (e.g. multiple virtual addresses which map to the same real address), so further research is called for.
When a write operation occurs, information is changed and this chance must eventually be reflected in main memory. This can be accomplished by (a) writing to cache memory and later copying back to main memory, (b) wriiinc simultaneously to both cache and main memory or (o7 writing to main memory only and destroying whatever copy nay exist in the cache. The first strategy seems to be the most efficient 132,37], but it results in two or more different copies of the same information. This wouldn't be a problem if all references to that information used tne same cache, but that is not necessarily the case in systems with channels or multiple processors. An important problem, yet to be definitively dealt with, is to design a scheme for maintaining memory consistency in a multiprocessor system where each processor has its own cache. IEM solves this problem by sending ail stores to both CPU caches [20] , which is only feasible for two processor systems; the bandwidth of each cache is insufficient for a larger system. The S-l computer [26] avoids this problem by enforcing correct operation through the software". Tang [45] has proposed a schese whereby the main memGry forces consistency by keeping track of which caches contain what information. Other methods are also possible, such as one in which each cache maintains a record of what information is potentially shared. There is no generally accepted solution to thi3 problem and further work needs tc be done.
A major problem with cache Qemories is address space swapping during task switching.
When the processor switches processes, the locality of reference generally changes abruptly " and completely, with the result that the information currently in the cacne is no longer in use and memory locations accessed by the new process will not be cache resident.
This Ls tne crobiem considered by Eastnn and Bennett [12] who ciscuss the difference between warm start ifull caone; and cold start (empty cache) miss ratios. Very little has been done to see what the effect of this problem is in real systems or to studv how to minimize it. One possibility is two caches' -one for user state and one for supervisor state.
In cache memories, therefore, most of the "hit - . [30] ). Overall, the difficulty with all of these schemes is that while main memory is becoming relatively inexpensive, none of these alternatives would be at all cheap; therefore there is the need to show significant benefits from some sort of intelligent memory before a commercially viable design will appear.
There are other minor changes that will occur in main memory design.
It is likely that some hardware will eventually be added to aid in paging, as has been suggested by Denning [10] and Morris L29J. Main memory may become more complex in order to salve the cache consistency problem that was discussed in the last section. These last points are minor items, though, and represent implementation decisions, not research problems. The access time gap has existed since the earliest computers, and multiprogramming currently serves quite adequately to keep the CPU busy while transfers of information across the access gaD occur.
IV. GAP FILLER TECHNOLOGY USE
Research in progress by the author indicates that a gap filler device will be very useful in computer systems by the early -aid ISoO's and will become necessary by the la^ 19°Q's. If one assumes that disk access times remain fixed (see below), it can be shown (using S ueueing network models.or simulations) that as the PU becomes faster, the degree of multiprogramming has to be increased to maintain full CPU utilization. This process of increasing the degree of multiprogramming runs into two limits as the"CPU becomes very fast: (a) the size of the disk system may not be sufficient to permit enough I/O operations to occur in parallel (this problem can be lessened in the case of sequential files by increasing the block size) ard (b) a high degree of multiprogramming implies a great deal of main memory, which aithousft it may not be very expensive, is still not free. Further, the numter of disk spindles may decrease with increasing dis*. densities [17] .
Tne result is that while gap filler technology isn't necessary in current computer systems, it will serve to improve cost/performance in very fast future systems for two reasons: (a) it will allow the degree of aultiprograaming to be decreased, with a consequent saving in main memory cost and (b) it will relieve the bottlenecks in the disk system that will occur when the desree of multiprogramming approaches or exceeds the number of disk spindles. If gap filler technology (GFTi is to be used in a computer system, the questions of "how" and "where" arise. There seeas to be four types of "how": (1) A separate, stand alone device Te.g, a drum replacement) could be built and used as would any other device. (2) A GFT device (GFTD) could be used, either with hardware or software managment, as a dynanic fiie migration device -as files were opened, thev would ce coved to this device.
(3) GFT could be usee as an extension of main memory. The memory address icnerated would refer to the GFT and what we call =ain memory currently would be the lower (slower) of two levels of cache.
CO GFT could be used as a cache for I/O streams. Tracks or cylinders of a disk could be buffered dynamically.
Measurements available tc the author (io be published eventually) show that alternative 01 above performs relatively poorly and alternative m very well. Measurements will soon be available for tl2 t thus completing the currently possible comparisons.
There doesn't seem to be any wav to directly evaluate the performance of #3, since there isn't any obvious way to determine how a system would behave if addressable main memory (i.e. the address space) increased by an order of magnitude or more, while "fast" memory didn't.
The One possible direction for disk development is to make disks "intelligent", since the cost of logic is declining rapidly with respect to the cost of the mechanical components of the disk. This is being dene with drums, as noted earlier, in the SAP (Relational Associative Processor) system at the University of Toronto [30] . In that case, the storage device is programmed to search for the desired record.
Similarly, the disk could be allowed to do its own error correction [It], a task which is currently allocated to one or more of the controller, channel, or CPU.
Overall, therefore, we see onlv three real research problems having to do with disks: intelligent disks, disks associated with gap filler technology, and disk (or I/O system) strategy routines. The remaining issues are ones of either straightforward development or more extensive implementation and publicity. The directions for development are much the same as those for research: Eore logic ^intelligent or not) will appear in the disk spindle or controller and this logic will serve to operate, correct errors and buffer the disk.
VI. MASS STORAGE
As noted earlier, mass storage devices (storage on the order of a trillion bits or more) have finally achieved ctmmarciai acceptance. Associated with this new capability are of course some new and interesting research problems.
All current and projected main storage devices have long access delays; thus it is important to keep frequently used data sets resident on faster devices. We call the problem of deciding when to move information from mass store to disk and later from disk back to mass store, the file migration problem.
The Over the next few years, the most visible changes that will occur in the mass store area will be in the development of larger and faster mass store devices. Less visible but no less important will be imoroved operation for these devices in terms of algorithms for migration, file placement, reliability and recovery. Also, we can expect to see somewhat more intelligent system software, which isolates the user from the details of the device to a greater extent than at present.
VII. LOGICAL and USEE VIEW PROBLEMS
One of the most iccortant problems in a memory hierarchy, and one which is not associated with anv one type of technology or level is the usee's view of the memory hierarchy.
It was proposed many years ago (e.g. [7] ) that the user be given a very large virtual address space, sufficient to encompass not only main memory, but the entire program and data space of his process. This virtual address would be mapped dynamically by the system to the physical storage, and the user would be encouraged to remain unaware of the physical location and attribute's of the data. There is nothing new in this idea, yet despite its obvious (to the author) advantages, it has been inralemented to only a very limited extent in most systems. An important and pressing envelopment problem is the introduction of such logical/physical independence for the memory hierarchy into new or existing operating systems.
Despite the comments in the above paragraph, there is some question as to how extensive the address space should actually be. Is it reasonable, for example, to make mass storage byte addressable, with tne consequent cost of 40 or more bits per address? There is cieariy the need for a means of dealing with mountabie volumes (tapes, disk). It is probably desirable to allow for dynamic mapping from one set of logical names (director.-ar.d file r-ames) to another (binary virtual addresses). Synonyms (many virtual address will map ir.to the same physical location) will probably occur, ar.d there are problems involved in determining how/when synonyms snould exist and how they should be handled because sf problems with consistency. Structures for name soaces, such as directory structures, have been studied sat without definite conclusions; there is room for further work.
Another aspect of =e.-sory hierarchies that has not beer, fully developed is the interface between operating systems an? data base systems.
A data base system can be considered to be a powerful command (query/ language on top of a verysophisticated file, system. Both command languages and file systems are part of operating systems;
