NUMA-aware User-Level Memory Management for Microkernel-Based Operating Systems by Kupferschmied, Philipp et al.
NUMA-Aware User-Level Memory Management for
Microkernel-Based Operating Systems
Philipp Kupferschmied, Jan Stoess, Frank Bellosa
System Architecture Group
University of Karlsruhe
In the recent past there has been a growing
interest from research and practitioners in oper-
ating system (OS) design with a minimal kernel
base. While minimal (or micro) kernel based sys-
tems are a long-standing idea from a standpoint
of OS research, such systems are now success-
fully employed in real-world scenarios, typically ei-
ther as true microkernel systems or as hypervisors
[5, 7]. At the same time, tremendous advances
in multi-processing and multi-core technology have
lead to computer platforms where the typical mem-
ory demands are so high that memory bandwidth
is becoming a crucial bottleneck. The answer to
that problem is a non-uniform memory architecture
(NUMA), where processors or subsets of processors
are connected to their own portion of main memory
via a dedicated bus, thus alleviating overall load
on each of the buses. NUMA architectures typi-
cally imply that the local portion of memory can
be accessed fast, whereas remote accesses are sig-
nificantly slower (as an example, an AMD Opteron
NUMA platform in our lab shows a remote-to-local
read memory access ratio of about 1.4). From a
software standpoint, data locality and placement
are thus of paramout importance in such systems.
Ideally, most of a processor’s working set should
remain in its local memory. Common techniques
to work toward that goal are data replication and
migration.
In this work, we explore the requirements for
a microkernel-based, NUMA-aware operating sys-
tem. On the one hand, designing a system with
regard to NUMA architectures requires to improve
the locality of all data items where possible. On
the other, the goal of maximum flexibility for a mi-
crokernel requires to keep the changes to the micro-
kernel itself to the minimum. We thus strive for an
OS architecture which allows for flexible memory
management at user-level, but also improves the
locality of kernel data structures without requiring
policies within the kernel.
Our approach stays in contrast to virtual ma-
chine systems such as Disco [2], since it enables ex-
plicit application-controlled NUMA memory man-
agement rather than to encapsulate applications in
virtual machines and to hide memory-management
issues from them. It is also different to the Tornado
[4] and K42 [1] approaches in that it restricts the
kernel to providing simple and low-level primitives
such as per-node address spaces, rather than using
a sophisticated clustered object system, which al-
lows multiple-component objects to appear like a
single object.
Our core kernel design principle is that the ker-
nel should provide only low-level and localized
memory-management abstractions, leaving most of
the NUMA address space construction to user-level.
Particularly, the kernel offers support for local ad-
dress space construction only: An address space
always belongs to the NUMA node on which it is
created, and the kernel allocates all address space
management information on that node. This leads
to more efficiency and better predictability of the
base address space construction primitives. How-
ever, an address space can contain memory map-
pings both to local and to remote memory. Ap-
plications can thus construct cross-node NUMA
address spaces by hand, that is, by creating logi-
cally coherent per-node address spaces with iden-
tical virtual-to-physical mappings, either to node-
local or to remote memory. Synchronization of ad-
dress spaces is left up to the applications; the kernel
1
First published in: Poster/WiP session of the EurosSys'09, Nuremberg, Germany, April 2009
EVA-STAR (Elektronisches Volltextarchiv – Scientific Articles Repository) 
http://digbib.ubka.uni-karlsruhe.de/volltexte/1000026891 
offers a cross-node communication primitive to al-
low signaling and synchronization on address space
updates. Node-local regions (virtual addresses that
are mapped to different physical addresses on each
node, a useful property e.g. for transparent code
replication) can easily be constructed by not syn-
chronizing all mappings between related address
spaces. When a thread migrates from one node
to another, it is also migrated into another address
space. The bookkeeping of which address spaces
belong to the same application is performed at user-
level.
For evaluation, we have built a NUMA-aware OS
prototype based on the L4 microkernel [6]. L4 of-
fers address spaces as a first class abstraction. They
are populated by user-level pagers, which can map
pages of memory into an address space when a
pagefault occurs, and, later on, unmap them asyn-
chronously if needed. The kernel represents ad-
dress spaces by means of page tables; the map-
ping dependencies are represented by means of a
mapping database (MDB) that keeps track of how
physical frames are mapped to address spaces. We
enhanced L4’s in-kernel representation of address
spaces by assigning a home node to each address
space, set to the NUMA node, on which the ad-
dress space is created. The kernel allocates mem-
ory for page tables and corresponding MDB entries
on that home node. When a user-level applica-
tion is parallely active on multiple nodes, we use
one L4 address space on each node instead of using
a single, node-spanning address space. By setting
the home node appropriately, this solution ensures
that the corresponding kernel data is allocated on
the NUMA node, on which the address space is
active. We propose to use a per-node pagefault
handler, each of which is responsible for handling
pagefaults that occur on its node. For all map-
pings that shall be global, that is, equally visible in
all per-node address spaces, synchronization is per-
formed between these pagers, and entirely at user-
level: From the viewpoint of the kernel, the per-
node address spaces are completely unrelated. This
eliminates the need for an additional synchroniza-
tion primitive within the kernel, the choice of which
would depend on hardware characteristics such as
the remote-to-local latency ratio [3]. The user-
level pagers can implement arbitrary synchroniza-
tion mechanisms, based on shared memory, IPC,
or combinations of both, and tailored towards the
hardware and application characteristics.
Synchronization between per-node address
spaces is performed lazily, by means of pagefaults.
Consequently, n pagefaults will occur until a global
mapping is established on n nodes. However, if
parallel threads work mainly on local data, the
actual number of pagefaults is comparable to using
a single address space only.
A pager can only unmap mappings it has estab-
lished. If a pager on one node wants to globally
unmap a page, it must notify all other pagers to
do so as well, making the unmap-operation more
expensive. Further investigations must show if this
overhead is acceptable in realistic scenarios, and if
there are more efficient solutions.
References
[1] J. Appavoo et al. Enabling scalable performance
for general purpose workloads on shared mem-
ory multiprocessors. Technical report, IBM Re-
search, 2003.
[2] E. Bugnion et al. Disco: Running commod-
ity operating systems on scalable multiproces-
sors. ACM Transactions on Computer Systems,
15(4), 1997.
[3] E. M. Chaves et al. Kernel-kernel communica-
tion in a shared-memory multiprocessor. Con-
currency: Practice and Experience, 5(3), 1993.
[4] B. Gamsa et al. Tornado: Maximizing locality
and concurrency in a shared memory multipro-
cessor operating system. In Proceedings of the
third Symposium on Operating Systems Design
and Implementation. 1999.
[5] G. Heiser. Hypervisors for consumer electron-
ics. In Proceedings of the 6th IEEE Consumer
Communications and Networking Conference.
2009.
[6] J. Liedtke. On microkernel construction. In
Proceedings of the 15th ACM Symposium on
Operating System Principles. 1995.
[7] VMWare. Virtualization overview. http://
www.vmware.com/pdf/virtualization.pdf.
2

