Based on functional reasons, we describe some concepts which must be implemented by a p-kernel and Wulf et al. [1974] . Traditionally, the word 'kernel' is used to denote the part of the operating system that is mandatory and common to all other software. The basic idea of the~-kernel approach is to minimize this part,
i.e. to implement outside the kernel whatever possible.
*GMD SET-RS, 53754 Sankt Augustin, Germany
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.
SIGOPS '95 12/95 CO, USA 01995 ACM 0-89791 -715-419510012 ...$3.50
The software technological advantages of this approach are obvious:
A clear p-kernel interface enforces a more modular system structure. 1 Servers can use the mechanisms provided by thẽ -kernel like any other user program.
Server malfunction is as isolated as any other user program's malfunction.
The system is more flexible and tailorable. Different strategies and APIs, implemented by different servers, can coexist in the system. Although much effort haa been invested in p-kernel construction, the approach is not (yet) generally accepted. This is due to the fact that most existing pkernels do not perform sufficiently well. Lack of efficiency also heavily restricts flexibility, since important mechanisms and principles cannot be used in practice due to poor performance.
In some cases, the p-kernel interface has been weakened and special servers have been re-integrated into the kernel to regain efficiency.
It is widely believed that the mentioned inefficiency (and thus inflexibility) is inherent to the p-kernel approach. Folklore holds that increased user-kernel mode and address-space switches are responsible.
At a first glance, published performance measurements seem to support this view. In fact, the cited performance studies measured the performance of a particular p-kernel based system without analyzing the reasons which limit efficiency. We can only guess whether it is caused by the~-kernel approach, by the concepts implemented by this particular p-kernel or by the implementation of the p-kernel. open. It might be possible that we are still not applying the appropriate construction techniques.
For the above reasons, we feel that a conceptual analysis is needed which derives p-kernel concepts from pure functionality requirements (section 2) and that discusses achievable performance (section 4) and flexibility (section 3). Further sections discuss portability (section 5) and the chances of some new developments (section 6).
Some p-Kernel Concepts
In this section, we reason about the minimal concepts or. "primitives"
that a p-kernel should implement.3 The determining criterion used is functionality, not performance. More precisely, a concept is tolerated inside the p-kernel only if moving it outside the kernel, i.e. permitting competing implementations, would prevent the implement ation of the system's required functionalist y.
We assume that the target system has to support interactive and/or not completely trustworthy applications, i.e. it has to deal with protection. We further assume that the hardware implements page-based virtual memory.
One inevitable requirement for such a system is that a programmer must be able to implement an arbitrary subsystem S in such a way that it cannot be disturbed or corrupted by other subsystems S'. This is the principle of independence: ,S can give guarantees independent of S'. The second requirement is that other subsystems must be able to rely on these guarantees. This is the principle of integrity: there must be a way for S1 to address S2 and to establish a communication channel which can neither be corrupted nor eavesdropped by s'.
Provided hardware and kernel are trustworthy, further security services, like those described by Gasser et al. [1989] , can be implemented by servers. Their integrity can be ensured by system administration or by user-level boot servers. For illustration: a key server should deliver public-secret RSA key pairs on demand. It should guarantee that each pair has the desired RSA property and that each pair is delivered only once and only to the demander. The key server can only be realized if there are mechanisms which (a) protect its code and data, (b) ensure that nobody else reads or modifies the key and (c) enable the demander to check whether the key comes from the key server. Finding the key server can be done by means of a name server and checked by public key based authentication. Grant.
The owner of an address space can grant any of its pages to another space, provided the recipient agrees. The granted page is removed from the granter's address space and included into the grantee's address space. The import ant restriction is that instead of physical page frames, the granter can only grant pages which are already accessible to itself.
Map.
The owner of an address space can map any of its pages into another address space, provided the recipient agrees. Afterwards, the page can be accessed in both address spaces. In contrast to granting, the page is not removed from the mapper's address space. Comparable to the granting case, the mapper can only map pages which itself already can access.
Flush.
The owner of an address space can j?ush any of its pages. The flushed page remains accessible in the flusher's address space, but is removed from all other address spaces which had received the page directly or indirectly from the flusher. Although explicit consent of the affected address-space owners is not required, the operation is safe, since it is restricted to own pages. The users of these pages already agreed to accept a potential flushing, when they received the pages by mapping or granting.
Appendix
A contains a more precise definition of address spaces and the above three operations.
.
Reasoning
The described address-space concept leaves memory management and paging outside the ,u-kernel; only the grant, map and flush operations are retained inside the kernel. Mapping and flushing are required to implement memory managers and pagers on top of the p-kernel. In general, granting is used when page mappings should be passed through a controlling subsystem without burdening the controller's address space by all pages mapped through it.
The model can easily be extended to access rights on pages. Mapping and granting copy the source page's access right or a subset of them, i.e., can restrict the access but not widen it. Special flushing operations may remove only specified access rights.
1/0
An address space is the natural abstraction for incorporating device ports. This is obvious for memory mapped 1/0, but 1/0 ports can also be included. The granularity of control depends on the given processor. The 386 and its successors permit control per port (one very small page per port) but no mapping of port addresses (it enforces mappings with v = v'); the PowerPC uses pure memory mapped 1/0, i.e., device ports can be controlled and mapped with 4K granularity. If S1 wants to send a message to S2, it needs to specify the destination S? (or some channel leading to S2 ). Therefore, the p-kernel must know which uid relates to S2. On the other hand, the receiver S2 wants to be sure that the message comes from S1.
Therefore the identifier must be unique, both in space and time.
In theory, cryptography could also be used. In practice, however, enciphering messages for local communication is far too expensive and the kernel must be trusted anyway.
S2 can also not rely on purely usersupplied capabilities, since S1 or some other instance could duplicate and pass them to untrusted subsystems without control of S2.
Flexibility
To illustrate the flexibility of the basic concepts, we sketch some applications which typically belong to the basic operating system but can easily be implemented An address-space switch thus requires a TLB flush. The real costs are determined by the TLB load operations which are required tore-establish the current working set later. If the working set consists of n pages, the TLB is fully-associative, has s entries and a TLB miss costs m cycles, at most min(n, s) x m cycles are required in total.
Apparently, larger untagged TLBs lead to a performance problem.
For example, completely reloading the Pentium's data and code TLBs requires at least (32 + 64) x 9 = 864 cycles. Therefore, intercepting a program every 100ps could imply an overhead of up to 9%. Although using the complete TLB is unrealistic~, In figure 3 , we present the results of Chen's figure 2-1 in a slightly reordered way. We have colored MCPI black that are due to system i-cache or d-cache misses. suggest a potential problem due to OS structure.
Chen and Bershad measured cache conflicts by comparing the direct mapped to a simulated 2-way cache. g They found that system self-interference is more important than user/system interference, but the data also
show that the ratio of conflict to capacity misses in Mach is lower than in Ultrix. Table 4 show-s the conflict (black) and capacity (white) system cache misses both in an absolute scale (left) and as ratio (right). From this we can deduce that the increased cache misses are caused by higher cache consumption of the system (Mach + emulation library + Unix server), not by conflicts which are inherent to the system's structure.
The next task is to find the component which is responsible for the higher cache consumption.
We 
Conclusion.
The hypothesis that p-kernel architectures inherently lead to memory system degradation is not substantiated.
On the contrary, the quoted measurements support the hypothesis that properly constructed p-kernels will automatically avoid the memory system degradation measured for Mach.
Non-Portability
Older ,u-kernels were built machine-independently on top of a small hardware-dependent layer. This approach has strong advantages from the software technological point of view: programmers did not need to know very much about processors and the resulting p-kernels could easily be ported to new machines. Unfortunately, this approach prevented these p-kernels from achieving the necessary performance and thus flexibility.
In retrospective, we should not be surprised, since building a p-kernel on top of abstract hardware has serious implications:
q Such a p-kernel cannot take advantage of specific hardware.
. It cannot take precautions to circumvent or avoid performance problems of specific hardware.
c The additional layer per se costs performance.
p-kernels form the lowest layer of operating systems beyond the hardware. Therefore, we should accept that they are as hardware dependent as optimizing code generators. We have learned that not only the coding but . even the algorithms used inside a p-kernel and its internal concepts are extremely processor dependent. As a result, on Pentium, the segment register method always pays (see figure 2) .
As a consequence, we have to implement an additional user-address-space multiplexer, we have to modify address-space switch rout ines, handling of user supplied addresses, thread control blocks, task control blocks, the IPC implementation and the address-space structure as seen by the kernel. In total, the mentioned changes affect algorithms in about half of all p-kernel modules.
1P C implementation. Due to reduced associativity, the Pentium caches tend to exhibit increased conflict misses.
One simple way to improve cache behaviour during IPC is by restructuring the thread control block data such that it profits from the doubled cache line size. This can be adopted to the 486 kernel, since it has no effect on 486 and can be implemented transparently to the user.
In the 486 kernel, thread control blocks (including kernel stacks) were page aligned. IPC always accesses 2 control blocks and kernel stacks simultaneously. The cache hardware maps the according data of both control blocks to identical cache addresses. Due to its 4-way associativity, this problem could be ignored on the 486. However, Pentium's data cache is only 2-way set-associative. A nice optimization is to align thread control blocks no longer on 4K but on lK boundaries.
(1K is the lower bound due to internal reasons.) Then there is a 7570 chance that two randomly selected control blocks do not compete in the cache. Surprisingly, this affects the internal bit-structure of unique thread identifiers supplied by the p-kernel (see [Liedtke 1993 ] for details). Therefore, the new kernel cannot simply replace the old one, since (persistent) user programs already hold uids which would become invalid.
Incompatible Processors
Processors of competing families differ in instruction set, register architecture, exception handling, cache/TLB architecture, protection and memory model. Especially the latter ones radically influence p-kernel structure.
There are systems with multi-level page tables, hashed page tables, (no) reference bits, (no) page protection, strange page protectionll, single/multiple page sizes, 232-, 243-, 252-and 264-byte address spaces, flat and segmented address spaces, various segment models, tagged/untagged TLBs, virtually/physically tagged caches.
] 1e.g. the 386 ignores write protection in kernel mode, the PowerPC supports read only in kernel mode but this implies that the page is seen in user mode as well.
The differences are orders of magnitude higher than be- that a hypothetical L3-equivalent "Exo-IPC" would cost about 65 cycles on the R2000. Finally, we must take into consideration that the cycles of both processors are not equivalent as far as most-frequently-executed instructions are concerned.
Based on SpecInts, roughly 1.4
486-cycles appear to do as much work as one R2000-cycle; comparing the five instructions most relevant in this context (2-op-alu, 3-op-alu, load, branch taken and not taken) gives 1.6 for well-optimized code. Thus we estimate that the Exo-IPC would cost up to approx. 100 486-cycles being definitely less than L3's 250 cycles. This substantial difference in timing indicates an isolated dz~erence between both processor architectures that strongly influences IPC (and perhaps other pkernel mechanisms), but not average programs.
In fact, the 486 processor imposes a high penalty on entering/exiting the kernel and requires a TLB flush per IPC due to its untagged TLB. This costs at least From the above example, we learn two lessons: Spin. Spin [Bershad et al. 1994; Bershad et al. 1995 In accordance with our processor-dependency thesis, the exokernel is tailored to the R2000 and gets excellent performance values for its primitives.
In contrast to our approach, it is based on the philosophy that a kernel should not provide abstractions but only a minimal set of primitives. A page potentially mapped at v in u is flushed, and the new value x is copied into UV. This operation is internal to the p-kernel. We use it only for describing the three exported operations.
A subsystem S with address space u can grant any of its pages v to a subsystem S with address space a' provided S' agrees:
(7:, + u~, UU+4.
Note that S determines which of its pages should be granted, whereas S' determines at which virtual address the granted page should be mapped in v'. The granted page is transferred to c+ and removed from a.
A subsystem S with address space u can map any of its pages v to a subsystem S with address space u' provided S' agrees: u:, + ((?, v) .
In contrast to grant, the mapped page remains in the mapper's space u and a link to the page m the mapper's address space (u, v) is stored in the receiving address space u', instead of transferring the existing link from av to a~,. This operation permits to construct address spaces recursively, i.e. new spaces based on existing ones.
Flushing, the reverse operation, can be executed without explicit agreement of the mappees, since they agreed implicitly when accepting the prior map operation. S can j7ush any of its pages: vu:, = (c)u) :C:, -Q.
Note that N and flush are defined recursively.
Flushing recursively affects also all mappings which are indirectly derived from UV.
No cycles can be established by these three operations, since * flushes the destination prior to copying. The recommended implementation of u is to use one mapping tree per physical page frame which describes all actual mappings of the frame. Each node contains (P, v) , where v is the according virtual page in the address space which is implemented by the page table P. Assume that a grant-, map-or flush-operation deals with a page v in address space c to which the page table P is associated.
In a first step, the operation selects the according tree by P., the physical page. In the next step, it selects the node of the tree that contains 
