Computer systems are rapidly changing. Over the next few years, we will see wide-scale deployment of dynamically-scheduled processors that can issue multiple instructions every clock cycle, execute instructions out of order, and overlap computation and cache misses. We also expect clock-rates to increase, caches to grow, and multiprocessors to replace uniprocessors. Using SimOS, a complete machine simulation environment, this paper explores the impact of the above architectural trends on operating system performance. We present results based on the execution of large and realistic workloads (program development, transaction processing, and engineering compute-server) running on the IRIX 5.3 operating system from Silicon Graphics Inc.
Introduction
Users of modern computer systems expect the operating system to manage system resources and provide useful services with minimal overhead. In reality, however, modern operating systems are large and complex programs with memory and CPU requirements that dwarf many of the application programs that run on them. Consequently, complaints from users and application developers about operating system overheads have become commonplace.
The operating system developer's response to these complaints Permission to make digital/hard copy of part or atl of this work for personal or classroom use is ranted without fee provided that copies are not made ? or distributed for pro it or commercial advantage, the copyright notice, the title of me publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To oopy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission andlor a fee.
SIGOPS '95 12/95 CO, USA Q 1995 ACM 0-89791 -715-419510012 ...$3.50 has been an attempt to tune the system to reduce the overheads. The key to this task is to identify the performance problems and to direct the tuning effort to correct them; a modern operating system is far too large to aggressively optimize each component, and misplaced optimizations can increase the complexity of the system without improving end-user performance. The optimization task is further complicated by the fact that the underlying hardware is constantly changing. As a result, optimizations that make sense on today's machines may be ineffective on tomorrow's machines.
In this paper we present a detailed characterization of a modern Unix operating system (Silicon Graphics IRIX 5.3), clearly identifying the areas that present key performance challenges. Our characterization has several unique aspects: (i) we present results based on the execution of large and realistic workloads (program development, transaction processing, and engineering computeserver), some with code and data segments larger than the operating system itselfi (ii) we present results for multiple generations of computer systems, including machines that will likely become available two to three years from now; (iii) we present results for both uniprocessor and multiprocessor configurations, comparing their relative performance; and finally (iv) we present detailed performance data of specific operating system services (e.g. file 1/0, process creation, page fault handling, etc.)
The technology used to gather these results is SimOS [1 1], a comprehensive machine and operating system simulation environment. SimOS simulates the hardware of modern uniprocessor and multiprocessor computer systems in enough detai 1 to boot and run a commercial operating system. Si mOS also contains features which enable non-intrusive yet highly detailed study of kernel execution, When running IRIX, SimOS supports application binaries that run on Silicon Graphics' machines, We exploit this capability to construct large, realistic workloads.
Focusing first on uniprocessor results, our data show that for both current and future systems the storage hierarchy (disk and memory system) is the key determinant of overall system performance. Given technology trends, we find that 1/0 is the first-order bottleneck for workloads such as program development and transaction processing. Consequently, any changes in the operating system which result in more efficient use of the 1/0 capacity would offer the most performance benefits.
After 1/0, it is the memory system which has the most significant performance impact on the kernel. Contrary to expectations, we find that future memory systems will not be more of a bottleneck than they are today. Although memory speeds will not grow as rapidly as instruction-processing rates, the use of larger caches and dynamically-scheduled processors will compensate.
mance. We find that extra memory stall corresponding to communication between the processors (coherency cache misses) combined with synchronization overheads result in most multiprocessor operating system services consuming 30~0 to 70% more computational resources than their uniprocessor counterparts. Because larger caches do not reduce coherence misses, the performance gap between uniprocessor and multiprocessor performance will increase unless operating system developers focus on kernel restructuring to reduce unnecessary communication.
The rest of the paper is organized as follows. Section 2 presents our experimental environment, including SimOS, workloads, and data collection methodologies.
Section 3 describes the current and future machine models used in this study. Sections 4 and 5 present the experimental results for the uniprocessor and mukiprocessor models. Finally, Section 6 discusses related work and Section 7 concludes.
Experimental Environment
In this section, we present the SimOS environment, describe our data collection methodology, and present the workloads used throughout this study.
The SimOS Simulation Environment
SimOS [11] is a machine simulation environment that simulates the hardware of umprocessor and multiprocessor computer systems in enough detail to boot, run, and study a commercial operating system. Specifically, SimOS provides simulators of CPUS, caches, memory systems, and a number of different 1/0 devices including SCSI disks, ethernet interfaces, and a console.
The version of SimOS used in this study models the hardware of machines from Silicon Graphics. As a result, we use Silicon Graphics' IRIX 5.3 operating system, an enhanced version of SVR4 Unix. This version of IRIX has been the subject of much performance tuning on umprocessors and on multiprocessors with as many as 36 processors. Although the exact characterization that we provide is specific to IRIX 5.3, we believe that many of our observations are applicable to other well-tuned operating systems.
Although many machine simulation environments have been built and used to run complex workloads, there are a number of unique features in SimOS that make detailed workload and kernel studies possible: Multiple CPU simulators. In addition to configurable cache and memory system parameters typically found in simulation environments, SimOS supports a range of compatible CPU simulators. Each simulator has its own speed-detail trade-off. For this study, we use an extremely fast binary-to-binary translation simulator for booting the operating system, warming up the file caches, and positioning a workload for detailed study. This fast mode is capable of executing workloads less than 10 times slower than the underlying host machine, The study presented in this paper uses two more detailed CPU simulators that are orders of magnitude slower than the fastest one. Without the fastest simulator, positioning the workloads would have taken an inordinate amount of time For example, booting and configuring the commercial database system took several tens of billion of instructions which would have taken several months of simulation time on the slowest CPU simulator.
Checkpoints.
SimOS can save the entire state of its simulated hardware at any time during a simulation, This saved state, which includes the contents of all registers, mam memory, and 1/0 devices, can then be restored at a later time. A single checkpoint can be restored to several different machine configurations, allowing the workload to be examined running on different cache and CPU parameters. Checkpoints allow us to start each workload at the point of interest without wasting time rebooting the operating system and positioning the applications.
Annotations.
To better observe workload execution, SimOS supports a mechanism called armotaiions in which a user-specified routine is invoked whenever a particular event occurs. Most annotations are set like debugger breakpoints so they trigger when the workload execution reaches a specified program counter address. Annotations are non-intrusive. They do not effect workload execution or timing, but have access to the entire hardware state of the simulated machme.
Data Collection
Because SimOS simulates all the hardware of the system, a variety of hardware-related statistics can be kept accurately and nonintrusively.
These statistics cover instruction execution, cache misses, memory stall, interrupts, and exceptions. The simulator is also aware of the current execution mode of the processors and the current program counter, However, this does not provide information on important aspects of the operating system such as the current process id or the service currently being executed.
To further track operating system execution, we implement a set of state machines (one per processor and one per process) and one pushdown automaton per processor to keep track of interrupts. These automata are driven by a total of 67 annotations For example, annotations set at the beginning and end of the kernel idle loop separate idle time from kernel execution time. Annotations in the context switch, process creation, and process exit code keep track of the current running process. Since they have access to all registers and memory of the machine, they can non-intrusively determine the current running process id and its name. Additional annotations are set in the page fault routines, interrupt handlers, disk driver, and at all hardware exceptions. These are used to attribute kernel execution time to the service performed. Annotations at the entry and exit points of the routines that acquire and release spin locks determine the synchronization time for the system, and for each individual spin lock.
Additionally, we maintain a state machine per line of memory to track cache misses. These state machines allows us to report the types of cache misses (i.e. cold, capacity, invalidation, etc.) and whether the miss was due to interference between the kernel and user applications. We also track cache misses and stall time by the program counter generating the misses and by the virtual address of the misses. This allows us to categorize memory stall both by the routine and the data structure that caused it.
Simulator Validation
One concern that needs to be addressed by any simulation-based study is the validity of the simulator. For an environment such as SimOS, we must address two potential sources of error. First, we must ensure that when moving the workloads into the simulation environment we do not change their execution behavior. Additionally, we must ensure that the timings and reported stati sties are correct. Establishing that SimOS correct] y executes the workload is fairly straightforward.
First, the code running on the real machine and SimOS are basically identical. The few differences between the IRIX kernel and its SimOS port are mostly due to the 1/0 device drivers that communicate with SimOS' timer chip, SCSI bus, and ethernet interface. This code is not performance critical and tends to be different on each generation of computer anyway. All user-level code is unmodified, Because SimOS simulates the entire machine, it's difficult to imagine these complex workloads completing correct] y without performing the same execution as on the real machine. As further validation of correct execution, we compare workioads running on a Silicon Graphics POWER Series multiprocessor and a similarly configured SimOS. At the level of the system call and other traps recorded by IRIX, the counts were nearly identical, and the differences are easi 1y accounted for.
A second potential source of error is in the environment's timing and the statistics collection. This kind of error is more difficult to detect since it is likely the workload will continue to run correctly. To validate the timings and statistic reporting. we configure SimOS to look like the one-cluster DASH multiprocessor used in a previous operating system characterization study [2] and examine the cache and profile statistics of a parallel compilation workload. Statistics in [2] were obtained with a bus monitor, and are presented in Table 2 Each workload has a uniprocessor and an eight-CPU multiprocessor configuration.
For each workload. we first boot the operating system and then log onto the simulated machine. Because operating systems frequently have significant internal state that accumulates over time, running the workloads directly after booting would expose numerous transient effects that do not occur in operating systems under standard conditions. To avoid these transient effects, we ensure in our experiments that kernel-resident data structures, such as the file cache and file system name translation cache, are warmed up and in a state typical of normal operation. We accomplish this either by running the entire workload once, and then taking our measurements on the second run, or by starting our measurements once the workload had run long enough to initialize the kernel data structures on its own.
Program
Development Workload. A common use of today's machines is as a platform for program development, This type of workload typically includes many small, short-lived processes that rely significantly on operating system services. We use a variant of the compile phase of the Modified Andrew Benchmark [10] . The Modified Andrew Benchmark uses the gcc compiler to compile 17 files with an average length of 427 lines each. Our variant reduces the final serial portion of the make to a single invocation of the archival maintainer (we removed another invocation of ar-as well as the cleanup phase where object files are deleted).
For the uniprocessor case, we use a parallel make utility configured to allow at most two compilation processes to run at any given time. For the eight-CPU multiprocessor case, we launch four parallel makes, and each allows up to four concurrent compilations. Each make performs the same task as the uniprocessor version, and on the average, we still maintain two processes per processor. To reduce the 1/0 bottleneck on the /tmp directory, we assign separate temporary directories (each & a separate disk device) to each make.
Database Workload.
As our second workload, we examine the performance impact of a Sybase SQL Server (version 10 for SGI IRIX) supporting a transaction processing workload. This workload is a banldcustomer transaction suite modeled after the TPC-B transaction processing benchmark [4] . The database consists of 63 Mbytes of data and 570 Kbytes of indexes. The data and the transaction logs are stored on separate disk devices. This workload makes heavy use of the operating system, specifically inter-processor communication.
hr the uniprocessor version of this workload, we launch 20 client processes that request a total of 1000 transactions from a single server. For the multiprocessor workload, we increase the number of server engines to 6 and drive these with 60 clients requesting a total of 1000 transactions, The database log is kept on a separate disk from the database itself. The multiprocessor database is striped across 4 disks to improve throughput.
Engineering
Workload.
The final workload we use represents an engineering development environment.
Our workload combines instances of a large memory system simulation (we simulate the memory system of the Stanford FLASH machine [7] using the FlashLite simulator) along with verilog simulation runs (we simulate the verilog of the FLASH MAGIC chip using the Chronologies VCS simulator). These applications are not operating system intensive because they do few system calls and require few disk accesses, but their large text segments and working sets stress the virtual memory system of the machine. This workload is extremely stable, and so we examine just over four seconds of execution.
The uniprocessor version runs one copy of F1ashLite and one copy of the VCS simulator. The multiprocessor version runs six copies of each simulator.
Architectural Models
One of the primary advantages of running an operating system on top of a machine simulator is that it is possible to examine the effects of hardware changes. In this paper we use the capabilities of SimOS to model several different hardware platforms. This section describes three different configurations which correspond to processor chips that first shipped in 1994, and chips that are likely to ship in 1996 and 1998. Additionally, we describe the parameters used in our multiprocessor investigations.
Common Machine Parameters.
While we vary several machine parameters, there are others that remain constant. All simulated machines contain 128 Mbytes of main memory, support multiple disks, and have a single console device. The timing of the disk device is modeled using a validated simulator of the HP 97560 disk) [6] . Data from the disk is transferred to memory using cache-coherent DMA. No input is given to the console and the ethernet controller during the measurement runs. The CPU models support the MIPS-2 1. We found that the performance of the database workload was completely f/O bound using the standard disk model incorporated into SimOS, Given that these disks do not represent the latest technology, we scale them to be four times faster in the database workload. 1994, 1996 , and 1998 machine model parameters.
The peak performance is achieved in the absence of memory or pipeline stalls The timings are the Iatency of the miss as observed by the processor.
All 2-way set associative caches use an LRU replacement policy instruction set. The memory management and trap architecture of the CPU models are that of the MIPS R3000. Memory management is handled by a software-reload TLB configured with 64 fully-associative entries and a 4 kilobyte page size.
1994 Model
We base the 1994 model onthe Indigo line of workstations from Silicon Graphics which contain the MIPS R4400 processor. The R4400 uses a fairly simple pipeline model that is capable of executing most instructions in a single clock cycle. It has a two level cache hierarchy with separate level-l instruction and data caches on chip, and an off-chip unified level-2 cache The MIPS R4400 has blocking caches. When a cache miss occurs the processor stalls until the miss is satisfied by the second level cache or memory system.
To model the R4400, we use a simple simulator which executes all instructions in a single cycle. Cache misses in this simulator stall the CPU for the duration of the cache miss. Cache size, organization, and miss penalties were chosen based on the SGI workstation parameters. 1 3 it is also possible to speculatively execute past branches whose outcome is yet unknown. Finally, non-blocking caches allow multiple loads and stores that miss inthecache to be l, The R44001evel-1 caches aredlrect mapped, butthenewer R4600 has two-way setwsociative level-l caches. Weconservatively choose tomoctel two-way setassociativity inour level-l caches serviced by the memory system simultaneously. Non-blocking caches, coupled with dynamic scheduling, allow the execution of any available instructions while cache misses are satisfied. This ability to hide cache miss latency is potentially a large performance win for programs with poor memory system ]ocahty, a characteristic frequently attributed to operating system kernels.
We model these next-generation processors using the MXS CPU simulator [1], We configure the MXS pipeline and caches to model the MIPS R1OOOO, the successor to the MIPS R4400 due out in early 1996.
The MXS simulator models a processor built out of decoupled fetch, execution, and graduation units. The fetch unit retrieves up to 4 instructions per cycle from the instruction cache into a buffer called the instruction window. To avoid waiting for conditional branches to be executed, the fetch unit implements abranchprediction algorithm that allows it to fetch through up to 4 unresolved conditional branches and register indirect jumps Asthefetch unit is filling the instruction window, theexecution unit is scanning it looking for instructions that are ready to execute. The execution unit can begin the execution of up to 4 instructions per cycle. Once the instruction execution has completed, the graduation unit removes the finished instruction from the instruction window and makes the instruction's changes permanent(i,e, they arecommitted totheregister file orto the cache), The graduation unit graduatesup to4 instructions per cycle. To support precise exceptions, instructions are always graduated in the order in which they were fetched.
Both the level-1 andlevel-2 caches arenon-blocking and support uptofour outstanding misses. The level-l caches support up to two cache accesses per cycle even with misses outstanding.
With cache miss stalls being overlapped with instruction execution and other stalls, it is difficuh toprecisely define a memory stall. When the graduation unit cannot graduate its full load of four instructions, we record the wasted cycles as stall time. We further decompose this stall time based on the state of the graduation unit. If the graduation unit cannot proceed due to a load or store instruction that missed m the data cache, we record this as data cache stall. Iftheentire instruction window isempty andthefetch unit is stalled on an instruction cache miss, we record an instruction cache stall. Finally, any other condition is attributed to pipeline stall because it is normally caused by pipeline dependencies.
Although MXS models the Iatencies of the R1OOOO instructions, it has some performance advantages over thereal RIOOOO, Its internal queues and tables are slightly more aggressive than the R1 0000. The reorder buffer can hold 64 instructions, the load/store queue can hold 32 instructions, and the branch prediction table has 1024 entries. Furthermore, it does not contain any of the execution-unit restrictions that are present in most of the next-generation processors. For example, the RI OOOOhas only one shifter functional unit, so it can execute only one shift instruction per cycle. MXS canexecute any four instructions percycle inchrding four shift instructions. We use this slightly more aggressive model in order to avoid the specifics of the R 10000 implementation and provide results that are more generally applicable. Additional parameters of the 1996 model arepresented in Table3 1,
1998 Model
It is difficult to predict the architecture and speeds of the processors that will appear in 1998 since they haven't been announced yet. Processors like the MIPS RI OOOO have significantly increased thecomplexity of the design while holding the clock rate relatively constant. The next challenge appears to be increasing the clock rate without sacrificing advanced processor features [5] . We assume that a 1998 microprocessor will contain thelatency tolerating features of the 1996 model, but will run ata 500Mhz clock rate and contain larger caches. We also allow for small improvements in cache and memory system miss times. The exact machine parameters are again shown in Our multiprocessor studies are based on an 8-CPU system with a uniform memory access time shared memory. We use 1994 model processors; multiprocessor studies with the MXS simulator were prohibitively time consuming. Each CPU has its own caches. Cache-coherency is maintained by a 3-state (invalid, shared, dirty) invalidation-based protocol. The cache access times and main memory-latency are modelled to be the same as those in the 1994 model.
Uniprocessor Results
The vast majority of machines on the market today are uniprocessors, and this is where we start our examination of operating system performance. In this section we present a detailed characterization of the workloads running on the three machine configurations.
In Section 4.1, we begin by describing the performance on the 1994 machine model. We then show in Section 4.2 how the 1996 and 1998 models improve performance. In Section 4,3 and Section 4.4, we show the specific impact of two architectural trends: latency hiding mechanisms and increases in cache size. Finally, in Section 4.5 we present a detailed examination of the relative call frequency, computation time, memory system behavior, and scaling of specific kernel services. Table 4 .2 describes the operating system and hardware event rates for the workloads. In Figure 4 .1, we provide a time-based profile of the execution of the workloads
Base Characterization
The program development workload makes heavy but erratic use of the kernel services resulting in 16~0 of the non-idle execution time being spent in the kernel. The frequent creation and deletion of processes result in the large spikes of kernel activity found in the profile. The workload also generates a steady stream of disk 1/0s, but contains enough concurrency to overlap most of the disk waits. As a result, the workload shows only a small amount of idle time.
The database workload makes heavy use of a number of kernel services. Inter-process communication occurs between the clients and the database server and between the database server and its asynchronous 1/0 processes. The result of this communication is both a high system call and context-switching rate. These effects, combined with a high TLB miss rate, result in the kernel occupying 38?10of the non-idle execution time. The database workload also makes heavy use of the disks. Data is constantly read from the database's data disk and log entries are written to a separate disk. Although the server is very good at overlapping computation with the disk operations, the workload is nevertheless idle for 369Z0of the execution time.
The engineering workload uses very few system services. Only the process scheduler and TLB miss handler are heavily used, and the kernel accounts for just 5% of the total workload execution time. The comb-like profile is due to the workload switching between the VCS and Flashlite processes, each of which has very different memory system behavior. Also visible in Figure 4 .1 is the large amount of memory stall time present in all of the workloads. Memory stall time is particularly prevalent in the database and engineering workloads, the two workloads that consist of large applications. 
Impact of Next-Generation Processors
In this section we examme the effect of future architectures on the three workloads. Figure 4 .3 shows the normalized execution time of the workloads as the machine model is changed from the 1994 to the 1996 and 1998 models. The speedups for the 1998 model range from a fairly impressive factor of 8 for the engineering workload to a modest 27910for the database. The primary cause of the poor speedup is delays introduced by disk accesses. This is the classic 1/0 bottleneck problem and can be seen in the large increases in idle time for tbe workloads with significant disk 1/0 rates. For tbe database system, the fraction of the execution time spent in the idle loop increases from 36% of the workload on the 1994 model to 75% of the time in tbe 1996 model and over go~o of the 1998 model. The program development workload also suffers from this problem with tbe 1998 model spending over 66% of the time waiting in the idle loop for disk requests to complete.
The implications of this 1/0 bottleneck on the operating system are different for the database and program development workloads. In the database workload, almost all of the disk accesses are made by the database server using the Unix "raw'" disk device interface, This interface bypasses the file system allowing the data server to directly launch disk read and write requests, Given this usage, there is little that the operating system can do to reduce tbe 1/0 time. Possible solutions include striping the data across multiple disks or switching to RAIDs and other higher performance disk subsystems.
In contrast, the kernel is directly responsible for the I/O-incurred idle time present in the program development workload. Like many other Unix file systems, the IRIX extent-based file system uses synchronous writes to update tile system meta-data structures whenever files are created or deleted. The frequent creation and deletion of compiler temporary files results in most of the disk traffic being writes to the meta-data associated with tbe temporary file directory. Almost half of tbe workload's disk requests are writes to the single disk sector containing the / tmp meta-data ! There have been a number of proposed and implemented solutions to the meta-data update problems. These sokrtions range from special-casing the / tmp directory and making it a memory-based file system to adding write-ahead logging to file systems [12] . i overall performance gains of tbe future machine models appear to apply equally to both user and kernel code. This implies that the relative importance of kernel execution time wi 11likely remain the same on next-generation machines. While this means that the kernel time will remain significant.
it is certainly preferable to increased kernel overhead, Second, the fraction of execution time spent in memory stalls does not increase on the significantly faster CPUS. This is a surprising result given the increase in peak processor performance, Figure 4 .4 shows the memory stall time on future machines expressed as memory stall cycles per instruction (MCPI), We see that next-generation machines have a significantly smaller amount of memory stall time than the 1994 model This is indeed fortunate since the 1996 and 1998 models can execute up to 4 instruction per cycle, making them much more sensitive to large stall times. If the 1996 and 1998 models had tbe 1994 model's MCPI, they would spend 80% to 90% of their time stalled. Figure 4 .4 also decomposes the MCPI into instruction and data cache stalls and into level-1 and level-2 cache stalls. Although instruction cache stalls account for a large portion of the kernel stall time on the 1994 model, the effect is less prominent on the 1996 and 1998 models. For the program development workload, the instruction cache stall time is reduced from 45% of the kernel stall time in tbe 1994 model to only 11YO of the kernel stall time in the 1998 model, Figure 4 .4 emphasizes the different memory system bebavior of the workloads. The relatively small processes that comprise tbe program development workload easily tit into tbe caches of future 1, SGI'S new file system, XFS, contains write-ahead logging of meta-data, Unfortunately, XFS was not available for this performance study. 2 To ensure that this omission does not compromise accuracy, we examined the program development and database workloads with disks that were 100 times faster. We found little differences ]n the non-rdle memory system behavior. when compared to the kernel's memory system behavior. In contrast, the engineering and database work~oads consist of very large programs which continue to suffer from significant memory stall time. In fact, their memory system behavior is quite comparable to that of the kernel. Other studies have concluded that the memory system behavior of the kernel was worse than that of application programs [3] . We find that this is true for smaller programs, but does not hold for large applications. The implication is that processor improvements targeted at large applications will likely benefit kernel performance as well.
The improved memory system behavior of the 1996 and 1998 models is due to two features: latency tolerance and larger caches, In Section 4.3 and Section 4.4 we examine separately the benefits of these features.
Latency Tolerance
Dynamically scheduled processors can hide portions of cache miss latencies. Figure 4 .5 presents the amount of memory stall time observed in the 1996 and 1998 models and compares it with the memory stall time of the comparable statically-scheduled models. The numbers for this figure were computed by running the 1994 model configured with the same caches and memory system as the next-generation models and comparing the amount of memory stall seen by the processor.
The figure emphasizes two results. First, dynamically scheduled processors are more effective at hiding the shorter Iatencies of level-1 misses than that of level-2 misses. Dynamically scheduled processors hide approximately half of the latency of kernel level-1 misses. The notable exception is the engineering workload which spends most of its limited kernel time in the UTLB miss handler. We discuss the special behavior of this routine in Section 4,5.
Unfortunately, level-2 caches do not benefit from latency hiding as much as level-1 caches. The longer latency of a level-2 miss makes it more difficult for dynamic scheduling to overlap significant portions of the stall time with execution. Level-2 miss costs are equivalent to the cost of executing hundreds of instructions. There is simply not enough instruction window capacity to hold the number of instructions needed to overlap this cost. Although it is possible to overlap Ievel-2 stalls with other memory system stalls. we didn't observe multiple outstanding Ievel-2 misses frequently enough to significantly reduce the stall time.
A second and somewhat surprising result from Figure 4 .5 is that the future processor models are particularly effective at hiding the latency of instruction cache misses. This is non-intuitive because when the instruction fetch unit of the processor stalls on a cache miss, it can no longer feed the execution unit with instructions, The effectiveness is due to the decoupling of the fetch unit from the execution unit. The execution unit can continue executing the instructions already in the window while the fetch unit is stalled on an instruction cache miss. Frequent data cache misses cause both the executing instruction and dependent instructions to stall, and give the instruction unit time to get ahead of the execution unit. Thus, next generation processors overlap instruction cache miss latency with the latency of data cache misses while statically-scheduled processors must suffer these misses serially.
Larger Cache Sizes
Future processors will not only have latency tolerating features, but will also have room for larger caches. The sizes of the caches are controlled by a number of factors including semiconductor technology as well as target cycle time. We first examine sizes likely to be found in on-chip level-I caches and then explore the sizes likely to be found in off-chip level-2 caches. Figure 4 .6 presents the average number of cache misses per instruction for each of the workloads. We explore a range of sizes that could appear in level-1 caches of future processors. We model separate data and instruction caches.
Level-1 Cache
One of the key questions is whether increasing cache sizes will reduce memory stall time to the point where operating system developers do not need to worry about it. The miss rates in Figure 4 .6 translate into different amounts of memory stalls on different processor models. For example, the maximum cost of a level-I cache miss which hits in the level-2 cache on the 1998 model is 60 instructions. A miss rate of just 0.4% on both instruction and data level-1 caches means that the processor could spend half as much time stalled as executing instructions. This can be seen in the memory stall time on the 1998 model.
Since larger caches will not avoid all misses, we next classify them into 5 categories based on the cause of a line's replacement. Cold misses occur on the first reference to a cache line. KERN-self occur when the kernel knocks its own lines out of the cache and self-inflicted misses in both the user applications (USER-self) and the kernel (KERN-self) than they are at reducing interference misses (USER-other and KERN-other). This is most striking in the database workload where USER-self and KERNself misses steadily decline with larger caches while the USERother and KERN-other misses remain relatively constant.
Apart from techniques such as virtual memory page coloring, the operating system has no control over user-self cache misses. Any improvements will necessarily have to come from improved memory systems. However, operating system designers can address KERN-self cache misses. For example, recent work has shown how code re-organization can reduce these instruction misses [14] .
Reducing the KERN-other and the USER-other misses is more problematic.
Most of the USER-other misses are due to interference between the user and kernel rather than interference between two user processes. In fact, these two miss types are quite complementary. When the kernel knocks a user line from the cache, the user often returns the favor knocking a kernel line out of the cache. Although re-organizing the kernel to reduce its cache footprint could decrease the amount of interference, the problem will remain. As long as two large code segments are trying to share the same level-1 instruction cache, there will be potential for conflicts.
We also explored the impact of cache associativity by looking at both direct-mapped and 4-way associative caches. Like cache size increases, higher associativities reduce self-induced misses significantly more than interference misses. We hypothesize that the entire state in the relatively small level-1 caches is quickly replaced after each transition between kernel and user mode. As long as this is the case, associativity will not significantly reduce interference misses.
Level-2 Caches
Figure 4.7 presents the miss rates for cache sizes likely to be found in off-chip, level-2 caches. These caches typically contain both instructions and data, have larger line sizes, and incur significantly higher miss costs than on-chip caches, For example, the latency of Similarly to level-1 caches, larger level-2 caches reduce misses substantially, but still do not totally eliminate them. For the program development and database workloads, a 4MB level-2 cache eliminates most capacity misses. The remaining misses are cold missesl.
Characterization of Operating System Services
In previous sections we have looked at kernel behavior at a coarse level, focusing on average memory stall time. In order to identify the specific kernel services responsible for this behavior, we use SimOS annotations to decompose the kernel time into the services that the operating system provides to user processes. Table 4 .8 decomposes operating system execution time into the most significant operating system services.
One common characteristic of these services is that the execution time breakdown does not change significantly when moving from the 1994 to the 1996 and 1998 models. This is encouraging since optimization intended to speed up specific services today will likely be applicable on future systems. We now examine separately the detailed operating system behavior of the three workloads, Program development workload. On the 1994 model, this workload spends about 50% of its kernel time performing services related to the virtual memory system calls, 3070 in file system related services, and most of the remaining time in process management services. Memory stall time is concentrated in routines that access large blocks of memory. These include DEMAND -ZERO and copy-on-write (COW) faults processing as well as the read and write system calls that transfer data between the application's address space and the kernel's file cache.
The larger caches of the 1998 model remove most of the level-2 capacity misses of the workload. The remaining stall time is due mostly to level-l cache misses and to level-2 cold misses.
1. Most of the cold mrsses are reatly misclassified capacity misses. The reason that they are misclassified is that the initial accesses to the memory preceded the detaded simulation start. Had the detailed examination started at boot time, these cold misses would have been classified as capacity misses The most .wgmficarrt services of the uniproce;sor workloads are presented in order-of their fraction of kernel computation time on the 1994 machme model. Lower-case services denote UNIX system catls, and upper-case serwces denote traps and tntermpts(INT). UTLB reloads the TLB for user addresses A DBL_FAULT occurs when the UTLB takes a TLB miss A QUICK FAULT is a page fault where the page m already in memory, and PFAULT denotes protection violation traps used for modify bit emulation and other purposes We compare the memory system behavior of the 1994 and 1998 models. The cycles per instruction (CPI) as well as the instruction and data memory CPI (i-MCP1, d-MCPI), are indicators of processor performance. The concentration of instruction and data memory stall (%stall (1) and %statl (D)) M expressed in percentage of at] kernel memory stalls, Latency numbers can be compared to computation t]me to determine the importance of 1/0 and scheduling on the latency of various services.
The DEMAND-ZERO page fault generates more than sf)~o of al]
In contrast, the latency tolerating features work well on the cold misses. This service amounts to 18% of kernel execution time block copy/zero routines. Non-blocking caches allow the procesin the 1994 model and 25% of kernel execution time in the 1998 sor to overlap cache miss stall with the block movement instrucmodel. In general, the services with large concentrations of cold tions. Additionally, they permit multiple outstanding cache misses, misses, including the copy-on-write fault and the write system allowing stall times to overlap. These features effectively hide all call, increase in importance. Fortunate] y, the operating system can level-1 block-copy stall time that is present in the 1994 model. address this problem. Many of the cold misses are due to the virThis effect can be seen in the relatively large speedups obtained by tual memory page allocation policy which does not re-use recently the read and write system calls, and the large level-1 data freed pages that would potentially still reside in the level-2 cache. cache stall time hidden in the workload (see Prog L 1-D in
The workload touches a total of 33 megabytes of memory. Given Figure 4 .5).
the large number of process creations and deletions, it is likely that The operating system can still potentially improve the perforthis memory footprint could be reduced by modifying the page mance of level-2 cache behavior in the block copy/zero routines.
allocator to return recently freed pages (the current allocation polThe current routines fill the load/store buffer of the processor icy delays page reuse to deal with instruction cache coherency before they are able to generate multiple outstanding level-2 cache issues).
misses. By re-coding the functions to force cache misses out ear-lier. we can overlap the staH time of multiple misses, substantially reducing the total level-2 cache stall time. Re-coding this type of routine to take advantage of next generation processors can reduce the performance impact of the block-copy routines.
Database workload. This workload spends over half of its kernel time in routines supporting inter-process communication between the client and the database server and about one third of its time performing disk 1/0s for the database server. Additionally. the 1994 model spends 13% of the kernel time servicing UTLB faults. Unlike the program development workload, the memory stall of this workload is evenly spread among the kernel services. Kernel instruction cache performance is particularly bad in the 1994 model, with instruction MCPIS of over 2,0 for several of the major services.
Most services encounter impressive speedups on the 1996 and 1998 models. The improvement in the cache hierarchy dramatically reduces the instruction and data MCPI of the main services. Unfortunately, one of the major services only shows a moderate speedup: the UTLB miss handler is only 1.4x (1996) and 3.8x (1998) faster than the 1994 model. Because of this lack of speedup the time spent in the UTLB handler increases to a quarter of the kernel time (10% of the non-idle execution time) in the 1998 model. The UTLB handler is a highly-tuned sequence of 8 instructions that are dependent on each other. They do not benefit from the multiple issue capabilities of the 1996 and 1998 models. Performance improvements will need to come from a reduction of the TLB miss rate. This can be achieved by improved TLB hardware or through the use of larger or variable-sized pages
Engineering
workload. This workload makes relatively few direct requests of the operating system, and the kernel time is dominated by the UTLB handler and by clock interrupt processing Latency effects. We have discussed the impact of architectural trends on the computation time of specific services, We now focus on their latency. Table 4 .8 also contains the request rate and the average request latency of the services. The average latency is an interesting metric as user processes are blocked during the processing of that service. For most services, the computation time is equivalent to the latency. However, the computation time is only a small fraction of the latency of some services. These services are either involved in 1/0, blocked waiting for an event, or rescheduled.
Services such as the open, close, and unlink system calls of the program development workload and the read and write system calls of the database workload frequently result in disk 1/0s. These system calls are limited by the speed of the disks on all processor models, resulting in both long delays for the calling process and show very little, if any, speedup. Only changes in the file system will reduce these Iatencies. System calls which block on certain events, such as select, also experience longer Iatencies.
The long latency of the fork system call results from the child process getting scheduled before the forking parent is allowed to continue,
Multiprocessor Effects
The move to small-scale shared-memory multiprocessors appears to be another architectural trend, with all major computer vendors developing and offering such products. To evaluate the effects of this trend, we compare the behavior of the kernel on a uniprocessor to its behavior on an 8-CPU multiprocessor.
Both configurations designed to run efficiently on the SGI Challenge series of multiprocessors, which supports up to 36 R4400 processors.
Base characterization
We begin our investigation with a high-level characterization of the multiprocessor workloads. Table 5 .2 presents hardware and software event rates for the workloads (aggregated over all processors), and Figure 5 . I presents the execution breakdown over time. While the uniprocessor and multiprocessor workloads have different compositions (also see Figure 4 .1 and Table 4 .2), the workloads are scaled to represent a realistic load of the same application mix running on the two configurations.
The workloads drive the operating system in similar ways, and thus provide a reasonable basis for performance comparisons.
Compared to Figure 4 .1, there is an increase in idle time. This idle time is due to load imbalance towards the end of the programdevelopment and engineering workloads, and due to 1/0 bottlenecks for the database workload. Two of the three workloads show an increase in the relative importance of kernel time. The kernel component increases due to worse memory system behavior and synchronization overheads. The portion of non-idle execution time spent in the kernel in the program development workload rises from 16% to 24%, and in the engineering workload it rises from 4.6% to s.s~o.
The database workload interestingly shows the opposite behavior. Although the fraction of time spent in the kernel decreases from 38.2% to 24.9% on the multiprocessor, this reduction is not due to improved kernel behavior. Rather, the multiprocessor version of the database server is a parallel application and requires more computation per transaction.
Multiprocessor Overheads
There are a number of overheads found in multiprocessor systems that are not present in uniprocessors. This section examines two of these overheads: synchronization and additional memory-stalls.
Synchronization. The multiprocessor IRIX kernel uses spinlocks to synchronize access to shared data structures. Overheads include the time to grab and release Iocks, as well as the time spent waiting for contended locks. Spinlocks are not used in the uniprocessor version of the kernel. A coherence miss occurs because the cache line was invalidated from the requestor's cache by a write from another processor. 1 An upgrade stall occurs when a processor writes to a cache line for which it does not have exclusive ownership.
The upgrade requires communication to notify other processors to invalidate their copy of the cache line. Figure 5 ,3 compares the uniprocessor and multiprocessor kernel miss rates for a range of level-2 cache sizes. In contrast to uniprocessors, larger caches do not reduce the miss rate as dramatically.
The reason is simple; coherence misses do not decrease with increasing cache size. Coherence misses correspond to communication between processors and are oblivious to changes in cache size.
The implications of this observation are quite serious. In uniprocessors, larger caches significantly reduce the miss rates, allowing large performance gains in the 1996 and 1998 models. In multiprocessors, larger caches do not reduce the miss-rate as effectively, and we will see a much higher stall time in future machines.
Although, we did not simulate a multiprocessor machine with the next generation CPUS, it is possible to make rough estimates regarding the magnitude of the problem. As mentioned in Section 4.4.2, a level-2 cache miss rate of just 0,1% stalls the 1998 processor for half as much time as it spends executing instructions. Figure 5 .3 shows that for the program development and database workloads we will have at least 0.8 misses per 100 kernel instructions. Thus, although the memory stall times for the 1994 multiprocessor do not look too bad, the stall times for future machines will be much worse, In the next section we examine specific operating system services and suggest possible improvements for 1 Cache-coherent DMA causes coherence misses in the uniprocessor workloads, but they constitute a very small fraction of the total mmses, adjacently. This line alone accounts for 18% of all coherence misses in the kernel. As the relative cost of coherence misses increases, programmers and compilers will have to pay much more attention to this type of data layout problem. Table 5 .4 also compares the latency of the services on both platforms.
Unlike the comparison of computation time. which always reports a slowdown, some services actually have a shorter latency on multiprocessors.
The fork system calls return in half the uniprocessor time because of the presence of alternate CPUS to run concurrent y both the forking parent and the chi Id. On a uniprocessor, the parent gets preempted by the newly created child process. System calls that perform 1/0 such as open, c 10S e, and unlink also show speedups of 1s~o to 259Z0 over the uniprocessor run. This is not due to a reduction in 1/0 latency but again due to the increased probability that a CPU is available when a disk 1/0 finishes. More specifically, the IRIX scheduler does not preempt the currently running process to reschedule the process for which an 1/0 finishes, and this causes the uniprocessor latency to be longer than simply the disk 1/0 latency. This scheduling policy also increases the latency of functions that synchronize with blocking locks. This can be seen in the 32-fold slow,down of the PFAULT exception.
Database workload. The general trends for the database workload look similar to those in the program development workload. The fraction of computation time taken by key system calls remains the same across uniprocessor and multiprocessor implementations. However, several aspects are unique to this workload. The database workload heavily utilizes inter-process communication, which is implemented differently by the uniprocessor and multiprocessor kernels. The uniprocessor kernel implements the socket send system call by setting a software interrupt (SW INT in Table 4 .8) to handle the reception of the message. The multiprocessor version hands off the same processing to a kernel daemon process (rtnetd).
The advantage of this latter approach is that the daemon process can be scheduled on another idle processor. As Table 5 .4 shows, this reduces the latency of a send system call by S9Y0 on the multiprocessor version.
Another significant difference is the increased importance of the END_IDLE state which takes 0.6% of kernel time on the uniprocessor but 5.0% of the time on the multiprocessor. This state captures the time spent between the end of the idle loop and the resumption of the process in its normal context. Two factors explain this difference. First, in the multiprocessor, all idle processors detect tbe addition of a process to the global run queue, but only one ends up running it. The rest (approximately one quarter of the processors in this workload) return back to the idle loop, having spent time in the END_ I DLE state. Second, a process that gets rescheduled on a different processor than the one it last ran on must pull several data structures to the cache of its new host processor before starting to run. This explains the large amount of communication measured during this transition, which amounts to 45% (coherence plus upgrade time) of the execution time for END_ IDLE.
The frequent rescheduling of processes on different processors increases the coherence traffic. Three data structures closely associated with processes (the process table, user areas, and kernel stacks), are responsible for 33~o of the kernel's coherence misses.
To hide part of the coherence miss latency, the operating system could prefetch all or part of these data structures when a process is rescheduled on another processor. The operating system may also benefit by using affinity scheduling to limit the movement of processes between processors. between these studies and ours is the methodology used to observe system behavior. Previous studies were based on the analysis of traces either using hardware monitors [2] [8] [ 13] or through software instrumentation [3] . To these traditional methodologies, we add the use of complete machine simulation for operating system characterization. We believe that our approach has several advantages over the previous techniques.
First, SimOS has an inherent advantage over trace-based simulation since it can accurately model the effects of hardware changes on the system. The interaction of an operating system with the interrupt timer and other devices makes its execution timing sensitive. Changes in hardware configurations impact the timing of the workload and result in a different execution path. However, these changes are not captured by trace-based simulations as they are limited to the ordering of events recorded when the trace was generated.
When compared to studies that use hardware trace generation to capture operating system events of interest, SimOS provides better visibility into the system being studied. For example, operating system studies using the DASH bus monitor [2][ 13] observe only level-2 cache misses and hence are blind to performance effects due to the level-1 caches, write buffer, and processor pipeline. Furthermore, because the caches filter memory references seen by the monitor, only a limited set of cache configurations can be examined. SimOS simulates all of the hardware and no events are hidden from the simulator. SimOS can model any hardware platform that would successfully execute the workloads.
Studies often use software instrumentation to annotate a workload's code and to improve the visibility of hardware monitors. Unfortunately, this instrumentation is intrusive. For example, Chen [3] had to deal with both time and memory dilations. Although this was feasible for a uniprocessor memory system behavior study, it becomes significantly more difficult on multiprocessors. The SimOS annotation mechanism allows non-intrusive system observation at levels of detail not previously obtainable. [13] , who used hardware-based traces, we were able to examine large applications and confirm their results. Using SimOS, however, we were able to examine the operating system in more detail and explore the effects of technology trends.
Concluding Remarks
We have examined the impact of architectural trends on operating system performance. These trends include transition from simple single-instruction-issue processors to multiple-instruction-issue 
