In this paper we evaluate the memory system behavior of two distinctly different implementations of the UNIX operating system: DEC's Ultrix, a monolithic system, and Mach 3.0 with CMU's UNIX server, a microkernel-based system. In our evaluation we use combined system and user memory reference traces of thirteen industry-standard workloads.
Introduction
In this paper we quantitatively evaluate the memory system behavior of two different implementations of the UNIX operating system. One system, DEC'S Ultrix, has a monolithic structure. The other, Mach 3.0 with CMU's UNIX server [1, 21] , has a microkernel structure. Both systems are derived from 4.2 BSD UNIX and share a nearly identical application programming interface, as
well as large amounts of code. We explore these two systems within the framework of seven popular assertions about the memory reference behavior of modern operating systems. These assertions, listed in Table 1 -1, arise from past experiences [16] , extrapolated microbenchmarks [9, 31] , and extensive measurements of real systems running real programs [3, 4, 5, 14, 15, 28, 35, 36] . Our evaluation relies on combined system and user memory reference traces generated through software instrumentation of the systems running a broad selection of workloads.
Previous trace-based studies have focused on variations in memory system structure [2, 3, 4, 10, 13, 28, 32] , multiprocessors and multiprocessor workloads [35, 36] , or subcomponents of the memory system [30] . In contrast, our goal is to explore the impact of operating system structure on the performance of a complete uniprocessor memory system. The rest of this paper is structured as follows.
In Section 2 we describe our trace methodology and present a broad summary of our measurements. In Section 3 we discuss the major differences in system behavior and performance between Mach and Ultrix. In Section 4 we evaluate the monolithic and microkernel implementations in the context of the assertions in Table 1-1. Finally, in Section 5 we summarize our results.
Trace overview
We measured the behavior of Ultrix and Mach running the thirteen industry-standard workloads described in Table 2 -2. Each program and operating system was instrumented with epoxie [10, 37] , which is a program that rewrites assembly code to record a complete address trace Assertion Implication 1. The operating system has less instruction and data locality
The operating system isn't getting faster as fast as user than user programs [14, 15] .
programs.
2. System execution is more dependent on instruction A balanced cache system for user programs may not be cache behavior than is user execution [35] .
balanced for the system.
3, Collisions between user and system references lead to A split user/system cache could improve performance. significant performance degradation in the memory system (cache and TLB) [30, 35, 36] .
Self-interference
is a problem in system instruction Increased cache associativity and/or the use of reference streams [28, 35] .
text placement tools could improve performance.
5. System block memory operations are responsible Programs that incur many block memory operations will for a large percentage of memory system reference run more slowly than expected. costs [31 , 35] . 6 . Write buffers are less effective for system (as A write buffer adequate for user code may not be opposed to user) reference streams [5, 18] .
adequate for system code.
7. Virtual page mapping strategies can have Systems should support a flexible page mapping interface, significant impact on cache performance [25, 29] .
and should avoid default strategies that are prone to pathological behavior. We collected our traces on a DECstation 5000/200 system, i running Ultrix and Mach 3.0 with CMU's UNIX server. 2 We ran the programs one at a time in single-user mode. The only activity was due to the program itself, the kernel, and in the case of Mach, the user-level operating system server. Some of the experiments described in this paper required that we run the same program several times against different simulated memory systems. To ensure consistency from run to run, the system was rebooted before every experiment.
The trace is fed into a simulation of the DS5000/200 memory system using the parameters shown in Ultrix is a monolithic system in which all operating system code is implemented in the kernel. A program running on Ultrix invokes the operating system through a system call interface. In contrast, Mach 3.0 is a microkernel that exports and implements a small number of orthogonal abstractions including interprocess communication (IPC), threads, and virtual memory.
HigherIevel operating system services are implemented in a userlevel process called the UNIX server. A program running on Mach 3.0 contacts the UNIX server through the Mach kernel's IPC interface [19] from the original BSD abstractions [6] , and is relatively machine-dependent. Mach uses a more flexible and aggressive virtual memory system which is partitioned into a machine-dependent and a machine-independent layer [33] .
For our cross-system comparisons, the major UNIX components of Ultrix and CMU's server are similar but not identical.
Although both systems are derived from the same code base, they have matured in different environments. We have nevertheless attempted to eliminate obvious superficial differences between the two systems. For example, both systems are compiled at the same optimization level with DEC's Ultrix compiler from MIPS Computer Systems,3 and both systems use a large file buffer cache (12 MB).
Sources of distortion
A traced program is both larger (about a factor of two) and slower (about a factor of 15) than its untraced counterpart. An address trace contains virtual addresses, yet the actual and simulated cache are indexed by physical addresses. A third source of distortion is therefore due to the simulator's model of the virtual memory system's page mapping strategy, The Ultrix mapping strategy, similar to the one described in Table 2 -1, is deterministic,
In contrast, Mach's strategy is random (a virtual page is bound to the first free physical page on the free list), This difference can have a measurable impact on system behavior (We isolate the effects of the page-mapping strategies in a later section.) Our simulator uses the deterministic mapping strategy described in Table 2 Table   2 -4. For Ultrix, system behavior is confined to the kernel.
For Mach, system behavior includes the kernel, the UNIX server, and the emulation library.
For a given workload, Ultrix issues more disk requests than Mach, resulting in greater idle instruction counts and delays. Ultrix is more conservative than Mach's UNIX server in forcing meta-data updates to disk, Additionally, programs under Mach are demand-paged, whereas under Ultrix they are loaded entirely at program startup, sometimes leading to unnecessary disk accesses.
On the average, we saw about 1.4 times more 1/0 requests for workloads under Ultrix than for Mach. Because this difference in 1/0 behavior is orthogonal to the issue of kernel architecture, we exclude idle references from our remaining discussion.
For workloads that rely heavily on UNIX services, the combined Mach system components (microkernel, UNIX server, and emulation library) execute more instructions and generally require more data references than Ultrix.
Memory cycles per instruction
We use our simulation results to calculate memory
which is the number of CPU stall cycles due to the memory system divided by the number of instructions executed, MCPI is one of several components of cycles per instruction (CF'Z), which is a metric commonly used to evaluate computer systems [22] .
Other components of C'PZ, such as one cycle per instruction for instruction execution, interlocks during multiply, divide, and floating point operations, and no-ops inserted by the compiler for load and branch delays, remain relatively constant even as processor cycle time decreases, In contrast, kfCPI is a function of the ratio of memory speed to processor speed, is less dependent on processor architecture, and will dominate overall CPI if current trends in processor and memory speed continue. As mentioned, we have excluded idle-loop activity from our A4CPI calculations. The idle loop rarely misses in the cache, so a system could achieve an artificially low kfCPI by executing an arbitrarily large number of idle instructions. Assuming an idle loop CPI of 1, the cycle count is then equal to the sum of the non-idle 5876  1270  51 96  149 98  1515 17  2015 38  2495  914  43 93  140 98  8965  3  9425  8  13220  2809  69 89  166 96  6893  6  7283  11  63684  27027  485 42  999 71  8257 12  9941 27  5555  2225  70 96  215 97  4778 12  5452 23  689324 247969  15612 52 28619 73 223588  26  295572  44  21601  8069  187 45  344 70  32034  2  32652  3  1005  0  222 61  2004 54 468287  2 467668  2  10632  0  126 88  254 97 296691  1 297901  1  17102  5667  4135 21  3735 19 139172  2 139036  2  18474  4983  6239  5  6292  7 122009  0  123020  1  1585  639  21 93  72 98  7968  2  8176  4  10823  134  138 82  326 84 967474  0 968074  0   Table 2 strttction count. As an example, consider gcc with a cycle count of (29318000
Dividing the cycle count by the clock speed (25 MHz), we compute a runtime of 4.22 seconds, which is close to the actual runtime ( Table 2 -2) of 4.20 seconds. Using A4CPI to compute execution times for the Mach workloads is less accurate as Mach's page mapping strategy is non-deterministic.
Comparative system behavior
As shown in the previous section, the most significant difference between Mach and Ultrix is the number and cost of non-idle instructions required to run an application. In this section we discuss the influence that major system components have on system performance.
In Figure 3-1 Second, Mach's virtual memory system executes more instructions than the one implemented in Ultrix, which has been flattened into a single machine-dependent layer. Mach has an additional machine-independent layer that is more costly than either systems' machine-dependent layer.
4S-MCPI IS computed as the system cycles /~y.rtem m.wucnons, and differs from MCPI due to the system in that It includes only system instructions.
This 
Seven assertions
In this section we evaluate the strength of the seven assertions enumerated in the introduction. Our basic strategy is to address each assertion in the context of our traces. In several cases we present the results of additional simulations in which we vary the base architecture to determine the sensitivity of system performance to the assertion in question.
System and user locality
As cache behavior is an indication of locality, Table 2-3 supports the first assertion: the operating system has less instruction and data locality than user programs.
The system can contribute up to 51910 of non-idle instruction cache references, but in most cases (17 of 26) the system contribution is less than 10%, Given this, a disproportionately large number of instruction cache misses are due to the system (greater than 70% for two-thirds of the workload/system pairs).
In terms of data references, the system contributes a larger percentage of misses than references, again supporting the assertion that the system's data locality is worse than the user's.
Even so, in only five of the cache behavior between Ultrix and Mach is small. As workloadlsystem combinations does the system contribute Mach incurs a larger number of cache misses than Ultrix, more than 90% of data misses, and only twelve if the and as nearly every additional cache miss is due to the threshold is lowered to 50%. Although the system's consystem, the percentage of misses due to the system is tribution of instruction and data references are comlarger. Mach and Ultrix that was represented visually in Figure   2 -1. Memory penalties due to system instruction and system data references are larger for Mach than for Ultrix, while user memory penalties are similar.
Increased system activity in Mach, as is shown in Figure 3- Although our separate user and system caches double the effective cache size, the general dominance of the two leftmost components in Figure 4 -1 indicates that they do not significantly reduce miss rates relative to a smaller unified cache. The largest interference effects (for example, lisp) occur when the cache miss rate is low, such that a few interference misses can result in a large relative change. The absolute contribution of competition misses to A4CPI is shown in 
Voluntary context switches
In a client-server system such as Mach, voluntary context switches can occur every time the client and server interact through IPC. For the workloads we consider, the cache miss penalty following a voluntary context switch is not significant. 6 On the client side, where the instruction cache miss rates are generally low but data cache miss rates are high, the cost of reloading the cache after a context switch is amortized over a large number of instructions. On the server side, instruction and data locality are already poor, limiting the impact of interleaved user references. This behavior is consistent with earlier results on competition in client-server systems [28] . However, the penalty from competition clearly depends on the client-server system in question. Recent studies of the X11 window server, for example, have shown that larger programs and more frequent voluntary context switches create more severe penalties [12] .
TLB behavior
The Ultrix kernel binary runs in unmapped kernel memory, largely isolating it from the TLB.
In contrast, only Mach's microkernel component runs unmapped; the UNIX server and emulator run in mapped memory.
Earlier research has shown that this structure can cause a significant increase in TLB activity [5, 30] . Table 4-3 In terms of MCPZ, though, the absolute contribution of system TLB misses to performance is generally not large,
shown by the last four columns of Table 4 -3. Moreover, high TLB MCPI is an indication of poor locality, which is also reflected in more severe cache penalties. Even in runs with the most extreme behavior, TLB penalties are consistently dominated by cache penalties (Table 4 -1) for both Ultrix and Mach.
System self-interference
Self-interference occurs when insufficient cache associativity results in cache misses. The impact of selfinterference in user-code is well-understood [23] . To evahtate the impact of system self-interference, we simulated a two-way LRU set associative cache of the same size as our direct-mapped cache. As in the previous section, user references are isolated from the system-only cache, although they continue to generate TLB misses and subsequent system activity. Self-interference has the largest relative impact when MCPI is low, and the smallest relative impact when MCPI is high. A high MCP1 implies that the cache is full, which is a situation that cannot be helped by increased as- For each workload/system pair this figure shows system selfinterference effects, as indicated by miss rates from directmapped andtwo-way associativecachesof the samesize. Each bar is composed of two regions. The darker region represents misses eliminated by associativity (those due to selfinterference). The lighter region represents misses that associativity does not eliminate. The number on the left end of the bar is MCP1 for the system-only direct-mapped cache.
Block operations
Operating systems perform block memory operations to transfer data between 1/0 devices and memory, and to copy data between address spaces, Table 4 In terms of MCPI, Table 4 -4 shows that block operations incur a larger absolute overhead for programs running on Mach than on Ultrix. Table 4 -5 shows that Mach generally references more data than Ultrix in block operations, and that more of those references go through to memory. Block operations in Mach occur within the kernel as part of the VM and IPC systems, and within the UNIX server as part of the file system. In contrast, Ultrix block operations, which occur entirely within the are due mostly to VM and file system operations. (Figure 2-1 ). The table also shows the number of data reads from cacheable and uncacheable memory that are due to block operations; the number of those reads that go to memory resulting m a CPU read stall, and the percentage of overall CPU memory stalls due to block operations.
Reads from uncacheable memory are due primarily to 1/0 operations and always go through to memory. All counts are in thousands. 
Page mapping strategy
The system's virtual page mapping strategy can affect the performance of a physically indexed cache, as it determines the placement and overlap of virtual pages in the cache. As an example, the operating system can reduce self-interference misses for small applications by using a virtual-to-physical mapping that uniformly distributes consecutive virtual pages throughout the cache. For localities smaller than the cache size, such a strategy prevents collisions in the cache. This strategy also makes possible tools that rearrange the layout of text and data in memory to improve cache performance [27, 17] .
In our discussion so far, we have simulated a deterministic strategy for both the Ultrix and Mach reference streams. As previously mentioned, Ultnx uses a deterministic strategy, while Mach's strategy is random (a virtual page is assigned to the next physical page on the free list). To isolate the effect of the page mapping strategy, we modified our simulator to use random mappings, and to maintain page tables so that page mappings do not change during a given run. virtual-to-physical mapping induced by the deterministic strategy causes frequent collisions between corresponding matrix elements during computation.
In some cases, the deterministic strategy yields a page mapping with low user cache miss rates. Specific examples are sed and lisp under Ultrix, and egrep and liv for both systems. In these cases the deterministic strategy leads to good behavior, and the random strategy can perform significantly worse. Our results, though, suggest that such cases are infrequent in the absence of program reordering.
Overall, these observations confirm the seventh assertion: virtual to physical page mapping strategy can have significant impact on cache performance.
Moreover, a deterministic strategy can have a negative impact on performance for a direct-mapped cache when program reordering tools are not used. In such cases, a random strategy is less likely to induce consistently poor behavior.
Conclusions
For the majority of workloads we consider, the number and cost of non-idle instructions executed is substantially higher for Mach than for Ultrix. Six of the assertions about operating systems and memory system behavior are true, although two have little or no impact on system performance. One is false. Several are sensitive to the operating system architecture. Specifically:
. System and user locality. System locality is measurably worse than user locality, and the performance impact can be significant, The Mach microkernel-based system has poorer system locality than Ultrix.
q System instruction locality.
Relative to user behavior, system text shows less locality than system data. However, user workloads such as gcc with large text can have instruction cache penalties that rival that of the operating system. q User/system competition.
User/system competition is a measurable component of cache and TLB miss rates. For these workloads, though, system performance is not affected by userlsystem competition.
The impact of
Mach's microkernel structure on competition is not significant.
q System self-interference.
Self-interference accounts for a significant number of system misses, particularly in system text. However, the cases with the worst overall behavior are also those that benefit least from associativity. The page mapping strategy is independent of operating system architecture. 
