In this paper we present Solemn, a new user-level simulation 
Introduction
Execution-driven simulation is an increasingly important tool in architecture performance analysis. These can be split into two main types: user-level simulators and complete machine simulators.
Complete machine simulators model the entire computer system, including the operating system kernel and devices.
User-level simulators model the execution of a single user process, and emulate any system calls made on the target (simulated) computer, possibly using equivalent system calls available on the host computer.
As a result of their detail, complete machine simulators can be quite accurate, but are extremely difficult to implement correctly. User-level simulators are excellent for investigating the behaviour of compute-bound algorithms, but are not as good at modelling IO-dominant programs, nor can they model the effects of cache and TLB pollution caused by operating system activity and other processes.
This paper presents a user-level simulator built on top of a complete machine simulator. This approach yields advantages over both traditional user-level simulation and complete machine simulation.
Sparc Sulima [1] is a complete machine simulator for the SPARC V9 architecture, in particular for the UltraSPARC [8] family of processors from Sun Microsystems. Solemn is a new user-level simulation mode with Sparc Sulima, allowing it to simulate unmodified Solaris executables (including dynamically linked executables).
Solemn emulates some of the work a real kernel has to do, such as file-related system calls. In particular it has a memory management subsystem, allowing the simulated program to use mmap, and in the future, threads.
The novel part of Solemn is that it emulates true virtual memory: different virtual addresses at different stages of the simulated program's execution could be mapped to the same physical address, and in some stages of the execution a virtual address may not be mapped at all.
All user-level instructions and most exceptions are simulated by Sparc Sulima. Solemn intercepts the trap instructions a program uses to communicate with the operating system kernel. The effect of the trap instruction (usually a system call) is emulated by Solemn before returning control back to Sparc Sulima.
Some of the reasons for extending Sparc Sulima to support the emulation of the Solaris ABI are:
• Sparc Sulima currently does not boot a full operating system, due to some device support issues. Booting a full OS in a complete machine simulator without vendor support is extremely difficult; only SimOS [6] and SimICS [3] have published success in this. Another emulation mode will help in testing more thoroughly the components of the Sparc Sulima system without requiring full OS boot.
• We can support a broader range of programs than tools like RSIM [5] , since we can support more system calls (such as mmap) and we can simulate unmodified and dynamically linked programs.
• We can observe more interesting effects than those obtained in a traditional user-level simulator like RSIM such as paging, since the basis is a complete machine simulator. This means, for example, we can examine the effect of running a program in a system with limited RAM.
• We can simulate the effects of small changes to the architecture quite easily. Within a complete machine simulator, such changes would require changes to the operating system, which would be difficult (if the source is available) if not impossible. With an almost entirely user-level simulator like Solemn such changes would be simple. Such changes, if required, would occur in a small assembler nucleus.
This paper is organised as follows: §2 provides background, including related work. The overall structure of Solemn is presented in §3, with detailed descriptions of significant parts therein. §4 details the system call handlers while §5 discusses the development of Solemn and its current status. Finally, conclusions and future work is in §6.
Background
This section discusses some related work, and provides background about the Sparc Sulima complete machine simulator and the Solaris operating system and ABI.
Related work
RSIM [5] is a user-level execution-driven SPARC V8 simulator. It has detailed CPU modelling (including pipelines and branch prediction) and detailed modelling of the memory system. SMP simulation can be used on programs using a restricted threads library, providing its source code is available.
RSIM can only emulate a quite restrictive set of system calls, and the program to be simulated must be statically linked with RSIM's own C library.
Shade [2] is an address trace generator / user-level simulator for unmodified SPARC binaries (32 or 64-bit, statically or dynamically linked). There is nothing in the literature explaining how it supports dynamically linked executables. It uses advanced techniques including binary translation to speed up its simulation.
Shade relies on the same host platform as the program it is simulating to simplify its system call emulation. Solemn uses some of the same techniques used by Shade, such as file descriptor wrapping. A feature of Solemn is that it does not require the same host platform for system call emulation.
SimICS [3] is a commercial complete machine simulator for various architectures, including SPARC V9. SimICS had a Solaris emulation mode for 32-bit binaries, but it is no longer maintained.
SimOS [6] is a complete machine simulator developed at Stanford, and originally simulated the MIPS architecture, but has since been extended by others to simulate other architectures including Alpha and PowerPC. We are not aware of any user-level simulator that uses the SimOS platform.
RSIM 
Sparc Sulima
Sparc Sulima models the UltraSPARC CPU and memory system "as is", using an object-oriented design implemented in C++. This modular approach, with modules corresponding closely to the components of the real system, aids in the readability and understandability of the simulator source.
Sparc Sulima explicitly models the UltraSPARC CPU, along with its MMU and caches, as well as a shared bus, with attached devices including RAM and ROM.
When simulating a multiple CPU system, Sparc Sulima gives each simulated CPU a time-slice of execution (usually something like 50 simulated cycles).
Each CPU interprets each instruction to be executed in a standard fetch-decode-execute cycle. After fetching and decoding the next instruction, the simulator evaluates the instruction.
For more details about the implementation of Sparc Sulima, see [1] .
The Solaris operating system architecture
Sun's Solaris operating environment is an implementation of UNIX. Solaris is available on other platforms such as Intel x86 and Itanium, but we are not interested in those, so where we refer to Solaris it is presumed to be SPARC Solaris. We are also only interested in more recent versions of Solaris (as far back as Solaris 2.6) and only on SPARC V9 based machines (e.g., UltraSPARC I, II and III).
For more detail on the SPARC ABI and Solaris internals see [7, 4, 10] file. That file may be part of a program file (in the case of the text segment) or an anonymous file created as required by the kernel (in the case of the heap or stack segments).
A memory map has some associated permissions: readable, writable, and executable. Memory maps can be created by the programmer using the system call mmap.
The typical virtual address spaces of 32 and 64-bit processes in Solaris are shown in Figure 1 . The stack segment is automatically grown as required. The heap segment is explicitly grown using the sbrk or brk system calls.
System calls.
A UNIX program communicates with the kernel via system calls. There are various levels of viewing system calls. At the top is the function level which is the level viewed by the program when it requires a particular service. The function is implemented in a library which translates the system call into an implementation-defined message to the kernel for the service. To save confusion, we will use the term system call only when referring to the implementation-defined message to and from the kernel; we will use the term C library function to refer to the system call wrapper.
In SPARC Solaris, the system call message is transmitted via the ta (trap-always) instruction. The ta instruction generates an immediate exception which causes a transfer of control to a trap-table in the kernel address space. The kernel can then perform the system call request. It eventually transfers control back to the program via the done instruction (which signifies a return from an exception and skips the trapping instruction).
The ta instruction includes a trap number, which is a 7-bit number (0 to 127). About 30 trap numbers are recognised by the Solaris kernel.
Some trap numbers are used for special or fast requests. Non system-call traps include such things as user breakpoints and flush/clean register window requests.
Trap numbers 0, 8 and 64 are system call trap numbers (respectively used in SunOS 4.x, Solaris 32-bit and Solaris 64-bit executables). The value of the %g1 register is used to determine which system call is being called.
The registers %o0 to %o5 contain the parameters to the system call, like a normal function call in the SPARC ABI [7] . After the system call returns %o0 and possibly %o1 contain the return value(s). The kernel communicates errors by setting the carry bit in the condition code register and returns the error number in %o0.
Solaris 9 defines 231 system calls. Most programs only use a small subset of the system calls, so we do not need to emulate them all. We can implement most system calls on demand as we find a test program that requires an unimplemented system call.
The structure of Solemn
The structure of Solemn is shown in Figure 2 . The left-half of the figure is an example standard Sparc Sulima SMP simulation, with 3 CPU's. The right-half of the figure is all Solemn-specific modules; the only interaction the standard simulation has with Solemn is via the ExternalHandler interface, which Solemn inherits (not shown in the figure). This allows Solemn to intercept Tcc calls (see §4).
The ROM attached to the Bus in the figure is a special Solemn-specific nucleus. This is described in §3. 1. Files are managed by the Files object; more detail on this is in §3.2.
The Solemn memory manager maintains all the information necessary for handling mapped memory. More detail on the memory manager is in §3. 3 , including details on the management of free space, virtual address mappings and page replacement.
The ELF loader is responsible for loading a particular file into simulated virtual memory. Solemn uses up to two instances of an ElfLoader to load an executable, and optionally the dynamic linker. This process is described in more detail in §4. 4 .
Type and structure conversions are described in more detail in §4. 1 .
When an exception occurs during system call emulation, Solemn uses a re-entry buffer to maintain state. This process is described in §4.2. 
The nucleus
The Solemn nucleus is a special ROM attached to the bus. It is located at virtual address = physical address = RSTVaddr, an UltraSPARC-specific address for the location of the RED-state trap The user trap table consists of trap entries for when exceptions occur when the trap-level is zero (i.e., user code is executing). Most of the exception handlers are filled with code that simply halts the simulation, since they should never occur. Some handlers are required; for example:
• Clean-window: clears all locals and outs.
• Spill (and fill): do the usual stores to (loads from) the stack. These are specialised depending on whether the user-level program is 32 or 64-bit.
• IMMU and DMMU miss: this extracts the faulting address from the MMU and calls the Solemn page-miss trap with %g1 = VA ( §4.3.3). If it returns, %g1 is now the physical address. A new entry is added to the TLB, and the faulting instruction is retried.
The nucleus trap table entries are called when an exception occurs during trap processing (either user-level or nucleus-level), unless the trap level gets too high, at which point the processor enters RED state and uses a trap base address of RSTVaddr.
The only nucleus traps that can occur in Solemn are DMMU miss and protection faults. These can occur during window spill or fill trap processing, since the part of the stack used may not have been accessed at user level yet (or recently, if a TLB entry was replaced). These handlers simply branch to the corresponding user-level handlers.
The files manager
The Solemn files manager is a wrapper around the host standard IO mechanism. It includes its own set of file descriptors which is distinct from the host file descriptors. Internally each file descriptor is mapped to a host file descriptor.
It also provides an interface to all the standard IO calls, such as open, close, read, write, etc, where these calls operate on a Files file descriptor.
The
The memory manager
The Solemn memory manager is responsible for dealing with all memory-related traps and system calls, such as page misses (trap) and mmap (system call). It contains structures that maintain the unused virtual address space and all current virtual address mappings. It also contains other information needed, such as the location and size of the stack and heap, and where unfixed mmaps should start searching for space.
3.3.1. The free space manager. The free space manager is essentially a set of virtual address (VA) intervals. The requirements of the free space manager are to maintain this set such that it contains any virtual address range that is neither mapped nor unavailable by some ABI (e.g., from zero to the base address of the executable is usually not available for mapping unless specifically asked for) or platform requirement (the 64-bit VA hole in the UltraSPARC I and II).
The free space manager also provides for the searching for free space of a given size and alignment. This is for use with an unfixed mmap. This search can start from any given point in the address space, and works down, returning the top-most interval available.
The virtual address mappings manager.
The virtual address mappings manager provides an interface to the current virtual address mappings, as well as the current physical address mapping (if any) each page has. It also manages physical page replacement.
For the purposes of Solemn the RAM and virtual address space is divided up into fixed-size pages (8KB) which can be individually accessed via a page-index (the address divided by 8KB = 8192).
The virtual address mappings are implemented by an associative array from virtual address interval to a structure that contains such information as the memory protections that this virtual address mapping is allowed, a pointer to the host memory for this virtual address mapping (allocated via host mmap), and page indices into the simulated RAM for each subpage if currently mapped to RAM.
As an example, this is used for a page miss to look up whether a given virtual address has a physical address currently assigned. If it does then that address is returned, otherwise a new physical address is assigned (using the pagereplacement policy).
There is also a reverse lookup mechanism which is a mapping from RAM physical pages back to the virtual mapping structure. This is implemented as a vector (indexed by physical page index) to a structure containing: a pointer to the simulated RAM that this page refers to; a reference to the virtual address mapping that has a sub-page mapped to this physical page; and a dirty bit indicating whether the page has been written to or not. This is used for a page fault, to look up whether a given physical address has a virtual mapping and if that virtual mapping is writable.
Finally, a page-replacement object manages the allocation of physical pages. Currently there is only one pagereplacement policy: least-recently-used.
Traps and system calls
Solemn intercepts the evaluation of Tcc calls, and decides what action to take depending on the trap number, and, if the trap is a system call, the system call number.
Solemn recognises and handles a few non system-call traps that are routinely called. It also handles Solaris 32 and 64-bit system calls (both are handled using the same interface).
Some trap numbers are used by the Solemn-specific nucleus to communicate with Solemn. These are from 0x60 onwards and some of these are described in §4.3.3, and §4. 4 .
Within Solemn, many system calls are simply passed on to the host system, possibly translating some values before calling the host system call, and converting results back to the simulation.
This section describes the structure and value conversions to deal with emulating system calls, and then covers in some detail how some system calls must be specially handled.
Solaris types and values
Much of Solemn by design is to emulate Solaris system calls. Perhaps the biggest problem with this are the types, structures and values expected by the kernel. In a real operating system, these are defined in system header files, but with Solemn we need to do it differently, since we would like Solemn to be able to compiled and run on a non-Solaris (and possibly even non-SPARC) system. In addition, we want to transparently simulate 32 and 64-bit executables; many of the types and structures change size (e.g., size t) and arrangement (e.g., struct stat) depending on architecture.
None of the types, structures and values expected by Solemn when compiled come from system header files. They are all defined in a set of interface classes (classes that have no state, just types): the Sol, Sol32 and Sol64 structures depicted in Figure 2 .
Sol defines the types that are common to both Solaris 32 and 64-bit architectures, mostly the *64 t types, but also other enumerated flags such as the open flags (e.g., o RDONLY) or mmap flags (e.g., mmap MAP FIXED).
Sol32 and Sol64 inherit these types and enumerations, and also define the architecture-specific types, such as size t and struct stat.
Re-enterable system calls
There are many system calls where there are buffers (e.g., open, ioctl, read, write) that point to data in the simulator to be read from (open, ioctl, write) or written to (ioctl, read). Solemn manages this reading from or writing to buffers by going through the calling CPU's MMU. It goes via the MMU since it cannot go directly to the RAM: not only is it possible that the data is in another CPU's cache (so cache-coherency is required) but it is also possible that the data is not in RAM at all.
Going via the MMU has its own complications: the MMU may not have a translation for the required page in its TLB (translation lookaside buffer). Hence the system call handler must detect the exception, stop system call emulation and return control back to the simulator generating the required exception. The exception will be processed, hopefully installing a TLB entry for the faulting address, and the faulting instruction (which is the system call) will be retried. The system call handler will then be called again.
The buffers in read and write system calls can be very large: these can be much bigger than the set of all pages that can be in the D-TLB at once. The system calls themselves have a side effect and so are not restartable. These facts mean that these system call handlers need to be reenterable: i.e., return to some state achieved part-way through emulation of the system call.
For example, with the read system call, the entire requested read can be performed at once (and only once) on the host, into a host-allocated buffer. The re-enterable part occurs if an exception occurs during copy back into the simulator. If an exception occurs, the portion of the buffer that has not yet been copied back is stored within Solemn and the exception propagated back to the simulator. Upon return to the system call, Solemn recognises that it is re-entering the read system call and simply branches to the copy-back procedure. This can continually cause exceptions, but the buffer for the next re-entry gets shorter each time (ignoring page protection faults).
Memory-related system calls
Memory related system calls such as brk, mmap and munmap and memory related system traps (e.g., page miss) are passed down to the memory-manager. Some of these are described here.
4.3.1.
The mmap system call. The mmap system call asks for a virtual address space that maps to (part of) a file.
The pages are currently implemented in Solemn at a fixed 8KB; variable page sizes are planned for a future Solemn revision.
An mmap system call is emulated by Solemn in a number of stages (we assume no errors here for simplicity). The virtual address range is chosen. A host mmap is performed on the required file of the required size; the resulting pointer is used as the backing storage for that simulated virtual address range. If the virtual address range contained previous mappings (this can only occur if the mmap request was FIXED) then the previous mappings are unmapped. The virtual address range is then removed from the free space manager. The virtual address range, protections, and backing store are added to the virtual address mappings. Finally, the virtual address is returned to the caller as a successful mmap.
Note that this does not allocate physical pages for the virtual address range. This is done on demand as page misses occur. This is called demand-paging and is quite a standard mechanism.
4.3.2.
The munmap system call. The munmap system call asks for all mappings within a virtual address range to be unmapped. Each mapping within that range goes through a number of stages:
1. Any pages mapped to physical RAM are swapped out:
(a) Firstly, for all CPU's, any TLB entries for that physical page, and cache lines referring to any parts of that physical page are flushed and invalidated. This is done via direct call from Solemn to a special hook in the system bus that implements this. In a real system this would probably involve interrupts (for processor-to-processor communication) and displacement loads (for invalidates).
(b) Then, if the page is dirty the page is written back to the backing store; in the simulation this is simply the location that was mmaped on the host machine.
2. The host backing store is unmapped using a host munmap.
3. The virtual address range is removed from the virtual address mappings.
Finally, the entire virtual address range is added to the free space manager.
MMU-miss system trap.
When a page miss occurs, this means that the requested load or store instruction (or instruction fetch) refers to an address which is not mapped within the simulated CPU's TLB (translation lookaside buffer). This causes an MMU-miss exception. The Solemn nucleus has short exception handlers for these exceptions ( §3.1.2) that mainly call this special Solemnspecific system trap to do most of the work. The MMU-miss system trap is given the following parameters: the faulting virtual address, and whether it is a data or instruction MMU miss.
There are three possible cases: the virtual address is invalid; or the virtual address is valid and has not been mapped to a physical address; or the virtual address is valid and has already been mapped to a physical address.
To simplify many system calls and to avoid using signals, invalid addresses cause the simulation to halt. Invalid addresses are rarely used in "valid" programs so this should have little practical effect. This also includes invalid addresses passed to system calls: instead of returning EFAULT, the simulation will halt. The invalid address case includes protection errors: an instruction miss to a non-executable page or a data miss to non-read/write page.
If the virtual address is valid and the page is already mapped to a physical address, then the system simply returns with the physical address as the return value. The nucleus inserts the new translation table entry into the TLB and retries the faulting instruction.
If the virtual address is valid and the page is not mapped to a physical address, then the virtual page must be assigned a physical page.
Firstly, a physical page is chosen. This is via a leastrecently-used page replacement policy (where "used" is defined as when a page miss or protection fault occurs).
In general, this physical page may have already been assigned a virtual page. So this existing page must be swapped out: this involves the same steps as the page swapping in the munmap system call emulation ( §4.3.2).
The faulting virtual page is copied from the backing store to the physical page (the RAM), and the system call returns with the physical address.
Note: with a data miss, the new TLB entry that the nucleus inserts is set to be read-only. Write-access checks are done during protection traps.
Initialisation and program loading
The initialisation of Solemn occurs when the nucleus calls the solemn-init system trap during the boot phase ( §3.1.1).
This sets up the file and virtual memory subsystems, then loads the program into virtual memory (see §4.4.1) and loads the dynamic linker as well if required (see §4. 4 
.2).
Finally, the initial process stack is created as required by Solaris.
Program loading and dynamic linking.
Program loading involves opening the executable file, and parsing the ELF (executable and linking format) structure contained within the file. The ELF header is checked to ensure that the file is a valid SPARC executable, and the entry point (the virtual address where the executable starts execution) is extracted.
An ELF executable also contains a program table with  loadable program table entries (a program table may The executable is loaded as per normal, but then the interpreter is also loaded, which is a similar process to loading the executable. The dynamic linker also requires that an auxiliary vector table be created within the process stack.
The development of Solemn
The development of Solemn is a follow on from the UserSim [1] module in Sparc Sulima. UserSim used the C library from RSIM [5] , and so the executable to be simulated needed to be statically linked with that C library. As a result, there was a limit to the sorts of programs that could be simulated within UserSim. In particular, programs that used mmap could not be simulated.
We investigated the possibility of extending UserSim and the C library to support mmap and dynamically linked executables. This idea was discarded to using an existing full C library, so we did not have to reinvent the wheel. Solaris was chosen as the platform since we were already familiar with it, and our main program of interest, Gaussian, was optimised for it.
The main issue with developing Solemn has been deficiencies in the documentation of Solaris. Some examples of this are:
• The initial process stack [7, pp 3P-25 -3P-27] and dynamic linking [9, pp 248-255 ] is reasonably well documented, although the required values for various auxiliary vector entries was only determined by reading the Solaris kernel source. • While the mapping from system call number to name is well documented, the fact that some systems calls are multiplexed, and the system call error method are not. Judicious reading of Solaris kernel and C library source was and continues to be required.
Debugging such a beast is, not surprisingly, quite difficult. This is particularly evident during dynamic linking, when a huge amount of instructions are simulated. The principle techniques used were:
• Targetted debugging information, including debugging levels and masks.
• Small test programs which test different components of Solemn, in particular different system calls.
Current status
Currently, Solemn can emulate a single-threaded Solaris executable, 32 or 64-bit, and statically or dynamically linked. Solemn recognises and handles about 25 system calls, although this is increasing as we test programs that use them.
The effect dynamic linking has on process startup is made quite evident when simply counting the number of instructions simulated. Table 1 shows the number of instructions evaluated when simulating a trivial C program. The overall simulation time of all of these is less than a second.
Conclusions and future work
Solemn extends Sparc Sulima to allow it to simulate unmodified, dynamically linked, 32 or 64-bit Solaris executables. With this support, the behaviour of many more interesting programs can be examined than was previously possible.
Since Solemn provides true virtual memory, including swapping, it is possible to examine the effect of running programs in limited RAM. This is something we believe is not possible in previous user-level simulators.
We are currently working on extending the system calls supported by Solemn, in particular thread support. This will allow us to simulate many commercial and scientific workloads.
Sparc Sulima (including Solemn) is open source, with source code available under the GNU GPL at http://cap.anu.edu.au/cap/projects/sulima/.
Acknowledgements

