A Portable Kernel Abstraction for Low-Overhead Ephemeral Mapping Management by Chanda, A. et al.
A Portable Kernel Abstraction For Low-Overhead Ephemeral
Mapping Management
Khaled Elmeleegy, Anupam Chanda, and Alan L. Cox
Department of Computer Science
Rice University, Houston, Texas 77005, USA
{kdiaa,anupamc,alc}@cs.rice.edu
Willy Zwaenepoel
School of Computer and Communication Sciences
EPFL, Lausanne, Switzerland
willy.zwaenepoel@epfl.ch
Abstract
Modern operating systems create ephemeral
virtual-to-physical mappings for a variety of pur-
poses, ranging from the implementation of inter-
process communication to the implementation of
process tracing and debugging. With succeed-
ing generations of processors the cost of creat-
ing ephemeral mappings is increasing, particu-
larly when an ephemeral mapping is shared by
multiple processors.
To reduce the cost of ephemeral mapping man-
agement within an operating system kernel, we
introduce the sf buf ephemeral mapping in-
terface. We demonstrate how in several kernel
subsystems — including pipes, memory disks,
sockets, execve(), ptrace(), and the vnode
pager — the current implementation can be re-
placed by calls to the sf buf interface.
We describe the implementation of the
sf buf interface on the 32-bit i386 architecture
and the 64-bit amd64 architecture. This imple-
mentation reduces the cost of ephemeral mapping
management by reusing wherever possible ex-
isting virtual-to-physical address mappings. We
evaluate the sf buf interface for the pipe, mem-
ory disk and networking subsystems. Our results
show that these subsystems perform significantly
better when using the sf buf interface. On a
multiprocessor platform interprocessor interrupts
are greatly reduced in number or eliminated alto-
gether.
1 Introduction
Modern operating systems create ephemeral
virtual-to-physical mappings for a variety of pur-
poses, ranging from the implementation of in-
terprocess communication to the implementation
of process tracing and debugging. To create an
ephemeral mapping two actions are required: the
allocation of a temporary kernel virtual address
and the modification of the virtual-to-physical ad-
dress mapping. To date, these actions have been
performed through separate interfaces. This pa-
per demonstrates the benefits of combining these
actions under a single interface.
This work is motivated by the increasing cost
of ephemeral mapping creation, particularly, the
increasing cost of modifications to the virtual-to-
physical mapping. To see this trend, consider the
latency in processor cycles for the invlpg in-
struction across several generations of the IA32
architecture. This instruction invalidates the
Translation Look-aside Buffer (TLB) entry for
the given virtual address. In general, the op-
erating system must issue this instruction when
it changes a virtual-to-physical mapping. When
this instruction was introduced in the 486, it took
12 cycles to execute. In the Pentium, its latency
increased to 25 cycles. In the Pentium III, its
latency increased to ∼100 cycles. Finally, in
the Pentium 4, its latency has reached ∼500 to
∼1000 cycles. So, despite a factor of three de-
crease in the cycle time between a high-end Pen-
tium III and a high-end Pentium 4, the cost of a
mapping change measured in wall clock time has
actually increased.
Furthermore, on a multiprocessor, the cost of
ephemeral mapping creation can be significantly
higher if the mapping is shared by two or more
processors. Unlike data cache coherence, TLB
coherence is generally implemented in software
by the operating system [2, 12]: The processor
initiating a mapping change issues an interpro-
cessor interrupt (IPI) to each of the processors
that share the mapping; the interrupt handler that
is executed by each of these processors includes
an instruction, such as invlpg, that invalidates
that processor’s TLB entry for the mapping’s vir-
tual address. Consequently, a mapping change is
quite costly for all processors involved.
In the past, TLB coherence was only an issue
for multiprocessors. Today, however, some im-
plementations of Simultaneous Multi-Threading
(SMT), such as the Pentium 4’s, require the op-
erating system to implement TLB coherence in a
single-processor system.
To reduce the cost and complexity of
ephemeral mapping management within an op-
erating system kernel, we introduce the sf buf
ephemeral mapping interface. Like Mach’s pmap
interface [11], our objective is to provide a
machine-independent interface enabling variant,
machine-specific implementations. Unlike pmap,
our sf buf interface supports allocation of tem-
porary kernel virtual addresses. We describe how
various subsystems in the operating system ker-
nel benefit from the sf buf interface.
We present the implementation of the sf buf
interface on two representative architectures,
i386, a 32-bit architecture, and amd64, a 64-bit
architecture. This implementation is efficient: it
performs creation and destruction of ephemeral
mappings in O(1) expected time on i386 and O(1)
time on amd64. The sf buf interface enables
the automatic reuse of ephemeral mappings so
that the high cost of mapping changes can be
amortized over several uses. In addition, this im-
plementation of the sf buf interface incorpo-
rates several techniques for avoiding TLB coher-
ence operations, eliminating the need for costly
IPIs.
We have evaluated the performance of the pipe,
memory disk and networking subsystems using
the sf buf interface. Our results show that these
subsystems benefit significantly from its use. For
the bw pipe program from the lmbench bench-
mark [10] the sf buf interface improves perfor-
mance up to 168% on one of our test platforms.
In all of our experiments the number of TLB in-
validations is greatly reduced or eliminated.
The rest of the paper is organized as follows.
The next two sections motivate this work from
two different perspectives: First, Section 2 de-
scribes the many uses of ephemeral mappings in
an operating system kernel; second, Section 3
presents the execution costs for the machine-level
operations used to implement ephemeral map-
pings. We define the sf buf interface and its
implementation on two representative architec-
tures in Section 4. Section 5 summarizes the
lines of code reduction in an operating system
kernel from using the sf buf interface. Sec-
tion 6 presents an experimental evaluation of the
sf buf interface. We present related work in
Section 7 and conclude in Section 8.
2 Ephemeral Mapping Usage
We use FreeBSD 5.3 as an example to demon-
strate the use of ephemeral mappings. FreeBSD
5.3 uses ephemeral mappings in a wide vari-
ety of places, including the implementation of
pipes, memory disks, sendfile(), sockets,
execve(), ptrace(), and the vnode pager.
2.1 Pipes
Conventional implementations of Unix pipes per-
form two copy operations to transfer the data
from the writer to the reader. The writer copies
the data from the source buffer in its user address
space to a buffer in the kernel address space, and
the reader later copies this data from the kernel
buffer to the destination buffer in its user address
space.
In the case of large data transfers that fill
the pipe and block the writer, FreeBSD uses
ephemeral mappings to eliminate the copy oper-
ation by the writer, reducing the number of copy
operations from two to one. The writer first de-
termines the set of physical pages underlying the
source buffer, then wires each of these physical
pages disabling their replacement or page-out,
and finally passes the set to the receiver through
the object implementing the pipe. Later, the
reader obtains the set of physical pages from the
pipe object. For each physical page, it creates an
ephemeral mapping that is private to the current
CPU and is not used by other CPUs. Henceforth,
we refer to this kind of mapping as a CPU-private
ephemeral mapping. The reader then copies the
data from the kernel virtual address provided by
the ephemeral mapping to the destination buffer
in its user address space, destroys the ephemeral
mapping, and unwires the physical page reen-
abling its replacement or page-out.
2.2 Memory Disks
Memory disks have a pool of physical pages. To
read from or write to a memory disk a CPU-
private ephemeral mapping for the desired pages
of the memory disk is created. Then the data is
copied between the ephemerally mapped pages
and the read/write buffer provided by the user.
After the read or write operation completes, the
ephemeral mapping is freed.
2.3 sendfile(2) and Sockets
The zero-copy sendfile(2) system call and
zero-copy socket send use ephemeral mappings
in a similar way. For zero-copy send the
kernel wires the physical pages correspond-
ing to the user buffer in memory and then
creates ephemeral mappings for them. For
sendfile() it does the same for the pages of
the file. The ephemeral mappings persist until the
corresponding mbuf chain is freed, e.g., when
TCP acknowledgments are received. The ker-
nel then frees the ephemeral mappings and un-
wires the corresponding physical pages. These
ephemeral mappings are not CPU-private be-
cause they need to be shared among all the CPUs
— any CPU may use the mappings to retransmit
the pages.
Zero-copy socket receive uses ephemeral map-
pings to implement a form of page remapping
from the kernel to the user address space [6, 4, 8].
Specifically, the kernel allocates a physical page,
creates an ephemeral mapping to it, and injects
the physical page and its ephemeral mapping into
the network stack at the device driver. After the
network interface has stored data into the physi-
cal page, the physical page and its mapping are
passed upward through the network stack. Ulti-
mately, when an application asks to receive this
data, the kernel determines if the application’s
buffer is appropriately aligned and sized so that
the kernel can avoid a copy by replacing the ap-
plication’s current physical page with its own.
If so, the application’s current physical page is
freed, the kernel’s physical page replaces it in the
application’s address space, and the ephemeral
mapping is destroyed. Otherwise, the ephemeral
mapping is used by the kernel to copy the data
from its physical page to the application’s.
2.4 execve(2)
The execve(2) system call transforms the call-
ing process into a new process. The new pro-
cess is constructed from the given file. This
file is either an executable or data for an inter-
preter, such as a shell. If the file is an executable,
FreeBSD’s implementation of execve(2) uses
the ephemeral mapping interface to access the
image header describing the executable.
2.5 ptrace(2)
The ptrace(2) system call enables one pro-
cess to trace or debug another process. It in-
cludes the capability to read or write the memory
of the traced process. To read from or write to the
traced process’s memory, the kernel creates CPU-
private ephemeral mappings for the desired phys-
ical pages of the traced process. The kernel then
copies the data between the ephemerally mapped
pages and the buffer provided by the tracing pro-
cess. The kernel then frees the ephemeral map-
pings.
2.6 Vnode Pager
The vnode pager creates ephemeral mappings to
carry out I/O. These ephemeral mappings are not
CPU private. They are used for paging to and
from file systems with small block sizes.
3 Cost of Ephemeral Mappings
We focus on the hardware trends that motivate the
need for the sf buf interface. In particular, we
measure the costs of local and remote TLB inval-
idations in modern processors. The act of inval-
idating an entry from a processor’s own TLB is
called a local TLB invalidation. A remote TLB
invalidation, also referred to as TLB shoot-down,
is the act of a processor initiating invalidation of
an entry from another processor’s TLB. When an
entry is invalidated from all TLBs in a multipro-
cessor environment, it is called a global TLB in-
validation.
We examine two microbenchmarks: one to
measure the cost of a local TLB invalidation and
another to measure the cost of a remote TLB in-
validation. We modify the kernel to add a custom
system call that implements these microbench-
marks. For local invalidation, the system call
invalidates a page mapping from the local TLB
100,000 times. For remote invalidation, IPIs are
sent to invalidate the TLB entry of the remote
CPUs. The remote invalidation is also repeated
100,000 times in the experiment. We perform this
experiment on the Pentium Xeon processor and
the Opteron processor. The Xeon is an i386 pro-
cessor while the Opteron is an amd64 processor.
The Xeon processor implements SMT and has
two virtual processors. The Opteron machine has
two physical processors. The Xeon operates at
2.4 GHz while the Opteron operates at 1.6 GHz.
For the Xeon the cost of a local TLB invali-
dation is around 500 CPU cycles when the page
table entry (PTE) resides in the data cache, and
about 1,000 cycles when it does not. On a Xeon
machine with a single physical processor but two
virtual processors, the CPU initiating a remote
invalidation has to wait for about 4,000 CPU cy-
cles until the remote TLB invalidation completes.
On a Xeon machine with two physical processors
and four virtual processors, that time increases to
about 13,500 CPU cycles.
For the Opteron a local TLB invalidation costs
around 95 CPU cycles when the PTE exists in the
data cache, and 320 cycles when it does not. Re-
mote TLB invalidations on an Opteron machine
with two physical processors cost about 2,030
CPU cycles.
4 Ephemeral Mapping Management
We first present the ephemeral mapping interface.
Then, we describe two distinct implementations
on representative architectures, i386 and amd64,
emphasizing how each implementation is opti-
mized for its underlying architecture. This sec-
tion concludes with a brief characterization of the
implementations on the three other architectures
supported by FreeBSD 5.3.
4.1 Interface
The ephemeral mapping management inter-
face consists of four functions that either
return an ephemeral mapping object or re-
quire one as a parameter. These func-
tions are sf buf alloc(), sf buf free(),
sf buf kva(), and sf buf page(). Table 1
shows the full signature for each of these func-
tions. The ephemeral mapping object is entirely
opaque; none of its fields are public. For his-
torical reasons, the ephemeral mapping object is
called an sf buf.
sf buf alloc() returns an sf buf for the
given physical page. A physical page is rep-
resented by an object called a vm page. An
implementation of sf buf alloc() may, at
its discretion, return the same sf buf to mul-
tiple callers if they are mapping the same phys-
ical page. In general, the advantages of shared
sf bufs are (1) that fewer virtual-to-physical
mapping changes occur and (2) that less kernel
virtual address space is used. The disadvantage is
the added complexity of reference counting. The
flags argument to sf buf alloc() is either 0
or one or more of the following values combined
with bitwise or:
• “private” denoting that the mapping is for
the private use of the calling thread;
• “no wait” denoting that sf buf alloc()
must not sleep if it is unable to allo-
cate an sf buf at the present time; in-
stead, it may return NULL; by default,
sf buf alloc() sleeps until an sf buf
becomes available for allocation;
• “interruptible” denoting that the sleep by
sf buf alloc() should be interruptible
by a signal; if sf buf alloc()’s sleep is
interrupted, it may return NULL.
If the “no wait” option is given, then the “in-
terruptible” option has no effect. The “private”
option enables some implementations, such as
the one for i386, to reduce the cost of virtual-
to-physical mapping changes. For example, the
implementation may avoid remote TLB invalida-
tion. Several uses of this option are described in
Section 2 and evaluated in Section 6.
sf buf free() frees an sf buf when its
last reference is released.
sf buf kva() returns the kernel virtual ad-
dress of the given sf buf.
sf buf page() returns the physical page
that is mapped by the given sf buf.
4.2 i386 Implementation
Conventionally, the i386’s 32-bit virtual address
space is split into user and kernel spaces to avoid
the overhead of a context switch on entry to and
exit from the kernel. Commonly, the split is
3GB for the user space and 1GB for the kernel
space. In the past, when physical memories were
much smaller than the kernel space, a fraction of
the kernel space would be dedicated to a perma-
nent one-to-one, virtual-to-physical mapping for
the machine’s entire physical memory. Today,
however, i386 machines frequently have physical
memories in excess of their kernel space, making
such a direct mapping an impossibility.
To accommodate machines with physical
memories in excess of their kernel space, the i386
implementation allocates a configurable amount
of the kernel space and uses it to implement a
virtual-to-physical mapping cache that is indexed
by the physical page. In other words, an access to
this cache provides a physical page and receives
struct sf buf * sf buf alloc(struct vm page *page, int flags)
void sf buf free(struct sf buf *mapping)
vm offset t sf buf kva(struct sf buf *mapping)
struct vm page * sf buf page(struct sf buf *mapping)
Table 1: Ephemeral Mapping Interface
a kernel virtual address for accessing the pro-
vided physical page. An access is termed a cache
hit if the physical page has an existing virtual-
to-physical mapping in the cache. An access is
termed a cache miss if the physical page does not
have a mapping in the cache and one must be cre-
ated.
The implementation of the mapping cache con-
sists of two structures containing sf bufs: (1) a
hash table of valid sf bufs that is indexed by
physical page and (2) an inactive list of unused
sf bufs that is maintained in least-recently-
used order. An sf buf can appear in both struc-
tures simultaneously. In other words, an unused
sf buf may still represent a valid mapping.
Figure 1 defines the i386 implementation of
the sf buf. It consists of six fields: an im-
mutable virtual address, a pointer to a physical
page, a reference count, a pointer used to imple-
ment a hash chain, a pointer used to implement
the inactive list, and a CPU mask used for opti-
mizing CPU-private mappings. An sf buf rep-
resents a valid mapping if and only if the pointer
to a physical page is valid, i.e., it is not NULL.
An sf buf is on the inactive list if and only if
the reference count is zero.
The hash table and inactive list of sf bufs are
initialized during kernel initialization. The hash
table is initially empty. The inactive list is filled
as follows: A range of kernel virtual addresses is
allocated by the ephemeral mapping module; for
each virtual page in this range, an sf buf is cre-
ated, its virtual address initialized, and inserted
into the inactive list.
The first action by the i386 implementation of
sf buf alloc() is to search the hash table
for an sf buf mapping the given physical page.
If one is found, then the next two actions are
determined by that sf buf’s cpumask. First,
if the executing processor does not appear in
the cpumask, a local TLB invalidation is per-
formed and the executing processor is added to
the cpumask. Second, if the given flags do
not include “private” and the cpumask does
not include all processors, a remote TLB in-
validation is issued to those processors miss-
ing from the cpumask and those processors are
added to the cpumask. The final three ac-
tions by sf buf alloc() are (1) to remove
the sf buf from the inactive list if its reference
count is zero, (2) to increment its reference count,
and (3) to return the sf buf.
If, however, an sf buf mapping the given
page is not found in the hash table by
sf buf alloc(), the least recently used
sf buf is removed from the inactive list. If the
inactive list is empty and the given flags include
“no wait”, sf buf alloc() returns NULL. If
the inactive list is empty and the given flags
do not include “no wait”, sf buf alloc()
sleeps until an inactive sf buf becomes avail-
able. If sf buf alloc()’s sleep is interrupted
because the given flags include “interruptible”,
sf buf alloc() returns NULL.
Once an inactive sf buf is acquired by
sf buf alloc(), it performs the following
five actions. First, if the inactive sf buf repre-
sents a valid mapping, specifically, if it has a valid
physical page pointer, then it must be removed
from the hash table. Second, the sf buf’s phys-
ical page pointer is assigned the given physical
page, the sf buf’s reference count is set to one,
and the sf buf is inserted into the hash table.
Third, the page table entry for the sf buf’s vir-
tual address is changed to map the given physical
page. Fourth, TLB invalidations are issued and
the cpumask is set. Both of these operations
depend on the state of the old page table entry’s
accessed bit and the mapping options given. If
the old page table entry’s accessed bit was clear,
then the mapping cannot possibly be cached by
any TLB. In this case, no TLB invalidations are
issued and the cpumask is set to include all pro-
cessors. If, however, the old page table entry’s
accessed bit was set, then the mapping options
determine the action taken. If the given flags in-
clude “private”, then a local TLB invalidation is
performed and the cpumask is set to contain the
executing processor. Otherwise, a global TLB in-
validation is performed and the cpumask is set
to include all processors. Finally, the sf buf is
returned.
The implementation of sf buf free()
decrements the sf buf’s reference count, in-
serting the sf buf into the free list if the
reference count becomes zero. When an
sf buf is inserted into the free list, a sleeping
sf buf alloc() is awakened.
The implementations of sf buf kva() and
sf buf page() return the corresponding field
from the sf buf.
4.3 amd64 Implementation
The amd64 implementation of the ephemeral
mapping interface is trivial because of this archi-
tecture’s 64-bit virtual address space.
During kernel initialization, a permanent, one-
to-one, virtual-to-physical mapping is created
within the kernel’s virtual address space for the
machine’s entire physical memory using 2MB su-
perpages. Also, by design, the inverse of this
mapping is trivially computed, using a single
arithmetic operation. This mapping and its in-
verse are used to implement the ephemeral map-
ping interface. Because every physical page has
a permanent kernel virtual address, there is no
recurring virtual address allocation overhead as-
sociated with this implementation. Because this
mapping is trivially invertible, mapping a physi-
cal page back to its kernel virtual address is easy.
Because this mapping is permanent there is never
a TLB invalidation.
In this implementation, the sf buf is
simply an alias for the vm page; in other
words, an sf buf pointer references a
vm page. Consequently, the implementations
of sf buf alloc() and sf buf page()
are nothing more than cast operations evaluated
at compile-time: sf buf alloc() casts
the given vm page pointer to the returned
sf buf pointer; conversely, sf buf page()
casts the given sf buf pointer to the returned
vm page pointer. Furthermore, none of the
mapping options given by the flags passed to
sf buf alloc() requires any action by this
implementation: it never performs a remote TLB
invalidation so distinct handling for “private”
mappings serves no purpose; it never blocks
so “interruptible” and “no wait” mappings
require no action. The implementation of
sf buf free() is the empty function. The
only function to have a non-trivial implemen-
tation is sf buf kva(): It casts an sf buf
pointer to a vm page pointer, dereferences
that pointer to obtain the vm page’s physical
address, and applies the inverse direct mapping
to that physical address to obtain a kernel virtual
address.
4.4 Implementations For Other Ar-
chitectures
The implementations for alpha and ia64 are iden-
tical to that of amd64. Although the sparc64 ar-
chitecture has a 64-bit virtual address space, its
virtually-indexed and virtually-tagged cache for
instructions and data complicates the implemen-
tation. If two or more virtual-to-physical map-
pings for the same physical page exist, then to
maintain cache coherence either the virtual ad-
dresses must have the same color, meaning they
conflict with each other in the cache, or else
caching must be disabled for all mappings to the
physical page [5]. To make the best of this, the
sparc64 implementation is, roughly speaking, a
hybrid of the i386 and amd64 implementations:
The permanent, one-to-one, virtual-to-physical
mapping is used when its color is compatible with
the color of the user-level address space map-
pings for the physical page. Otherwise, the per-
manent, one-to-one, virtual-to-physical mapping
cannot be used, so a virtual address of a compati-
ble color is allocated from a free list and managed
through a dictionary as in the i386 implementa-
tion.
5 Using the sf buf Interface
Of the places where the FreeBSD kernel uti-
lizes ephemeral mappings, only three were non-
trivially affected by the conversion from the orig-
inal implementation to the sf buf-based imple-
mentation: The conversion of pipes eliminated 42
lines of code; the conversion of zero-copy receive
eliminated 306 lines of code; and the conversion
of the vnode pager eliminated 18 lines of code.
Most of the eliminated code was for the alloca-
tion of temporary virtual addresses. For example,
to minimize the overhead of allocating temporary
virtual addresses, each pipe maintained its own,
private cache of virtual addresses that were ob-
tained from the kernel’s general-purpose alloca-
tor.
struct sf_buf {
LIST_ENTRY(sf_buf) list_entry; /* hash list */
TAILQ_ENTRY(sf_buf) free_entry; /* inactive list */
struct vm_page *m; /* currently mapped page */
vm_offset_t kva; /* virtual address of mapping */
int ref_count; /* usage of this mapping */
cpumask_t cpumask; /* cpus on which mapping is valid */
};
Figure 1: The i386 Ephemeral Mapping Object (sf buf)
6 Performance Evaluation
This section presents the experimental platforms
and evaluation of the sf buf interface on the
pipe, memory disk and network subsystems.
6.1 Experimental Platforms
The experimental setup consisted of five plat-
forms. The first platform is a Pentium Xeon
2.4 GHz machine, with hyper-threading enabled,
having 2 GB of memory. We refer to this plat-
form as Xeon-HTT. Due to hyper-threading the
Xeon-HTT has two virtual CPUs on a single
physical processor. The next three platforms are
identical to Xeon-HTT but have different proces-
sor configurations. The second platform runs a
uniprocessor kernel resulting in having a single
virtual and physical processor. Henceforth, we
refer to this platform as Xeon-UP. The third plat-
form has two physical CPUs, each with hyper-
threading disabled; we refer to this platform as
Xeon-MP. The fourth platform has two physical
CPUs with hyper-threading enabled, resulting in
having four virtual CPUs. We refer to this plat-
form as Xeon-MP-HTT. Unlike Xeon-UP, mul-
tiprocessor kernels run on the other Xeon plat-
forms. The Xeon has an i386 architecture. Our
fifth platform is a dual processor Opteron model
242 (1.6 GHz) with 3 GB of memory. Hence-
forth, we refer to this platform as Opteron-MP.
The Opteron has an amd64 architecture. All the
platforms run FreeBSD 5.3.
6.2 Executive Summary of Results
This section presents an executive summary of
the experimental results. For the rest of this paper
we refer to the kernel using the sf buf interface
as the sf buf kernel and the kernel using the orig-
inal techniques of ephemeral mapping manage-
ment as the original kernel. Each experiment is
performed once using the sf buf kernel and once
using the original kernel on each of the platforms.
For all experiments on all platforms, the sf buf
kernel provides noticeable performance improve-
ments.
For the Opteron-MP performance improve-
ment is due to two factors: (1) complete elimi-
nation of virtual address allocation cost and (2)
complete elimination of local and remote TLB
invalidations. Under the original kernel, the ma-
chine independent code always allocates a virtual
address for creating an ephemeral mapping. The
corresponding machine independent code, under
the sf buf kernel, does not allocate a virtual ad-
dress but makes a call to the machine dependent
code. The cost of virtual address allocation is
avoided in the amd64 machine dependent imple-
mentation of the sf buf interface which returns
the permanent one-to-one physical-to-virtual ad-
dress mappings. Secondly, since the ephemeral
mappings returned by the sf buf interface are
permanent, all local and remote TLB invalida-
tions for ephemeral mappings are avoided under
the sf buf kernel. The above explanation holds
true for all experiments on the Opteron-MP and,
hence, we do not repeat the explanation for the
rest of the paper.
For the various platforms on the Xeon, the per-
formance improvement under the sf buf kernel
were due to: (1) reduction of physical-to-virtual
address allocation cost and (2) reduction of local
and remote TLB invalidations. On the i386 archi-
tecture, the sf buf interface maintains a cache
of physical-to-virtual address mappings. While
creating an ephemeral mapping under the sf buf
kernel, a cache hit results in reuse of a physical-
to-virtual mapping. The associated cost is lower
than the cost of allocating a new virtual address
which is done under the original kernel. Further,
a cache hit avoids local and remote TLB invalida-
tions which would have been required under the
original kernel. Secondly, if an ephemeral map-
ping is declared CPU-private, it requires no re-
mote TLB invalidations on a cache miss under the
sf buf kernel. For each of our experiments in the
following sections we articulate the reasons for
performance improvement under the sf buf ker-
nel. Unless stated otherwise, the sf buf kernel on
a Xeon machine uses a cache of 64K entries of
physical-to-virtual address mappings, where each
entry corresponds to a single physical page. This
cache can map a maximum footprint of 256 MB.
For some of the experiments we vary this cache
size to study its effects.
The Xeon-UP platform outperforms all other
Xeon platforms when the benchmark is sin-
gle threaded. Only, the web server is multi-
threaded, thus only it can exploit symmetric
multi-threading (Xeon-HTT), multiple proces-
sors (Xeon-MP), or the combination of both
(Xeon-MP-HTT). Moreover, Xeon-UP runs a
uniprocessor kernel which is not subject to the
synchronization overhead incurred by multipro-
cessor kernels running on the other Xeon plat-
forms.
6.3 Pipes
This experiment used the lmbenchbw pipe pro-
gram [10] under the sf buf kernel and the origi-
nal kernel. This benchmark creates a Unix pipe
between two processes, transfers 50 MB through
the pipe in 64 KB chunks and measures the band-
width obtained. Figure 2 shows the result for
this experiment on our test platforms. The sf buf
kernel achieved 67%, 129%, 168%, 113% and
22% higher bandwidth than the original kernel
for the Xeon-UP, Xeon-HTT, Xeon-MP, Xeon-
MP-HTT and the Opteron-MP respectively. For
the Opteron-MP, the performance improvement
is due to the reasons explained in Section 6.2.
For the Xeon platforms, the small set of physical
pages used by the benchmark are mapped repeat-
edly resulting in a near 100% cache-hit rate and
complete elimination of local and remote TLB
invalidations as shown in Figure 3. For all the
experiments in this paper we count the number
of remote TLB invalidations issued and not the
number of remote TLB invalidations that actually
happen on the remote processors.
6.4 Memory Disks
This section presents two experiments — one us-
ing Disk Dump (dd) and another using the Post-
Mark benchmark [9] — to characterize the effect
     
  












































 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 




























 
 
 
 
 
 
 
 
 
	
	
	
	
	
	
	
	
	
















































 0
 500
 1000
 1500
 2000
 2500
 3000
B
an
d
w
id
th
 (
M
B
/s
)
Pipe
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf
original
Figure 2: Pipe bandwidth in MB/s
  
  

      ﬀ ﬁﬂ ﬃ  ! 1
 10
 100
 1000
 10000
 100000
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
Pipe
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf
original
Figure 3: Local and remote TLB invalidations is-
sued for the pipe experiment
of the sf buf interface on memory disks.
6.4.1 Disk Dump (dd)
This experiment uses dd to transfer a memory
disk to the null device using a block size of 64
KB and observes the transfer bandwidth. We
perform this experiment for two sizes of mem-
ory disks — 128 MB and 512 MB. The size of
the sf buf cache on the Xeons is 64K entries,
which can map a maximum of 256 MB, larger
than the smaller memory disk but smaller than the
larger one. Under the sf buf kernel two configu-
rations are used — one using the private mapping
option and the other eliminating its use and thus
creating default shared mappings.
Figures 4 and 6 show the bandwidth obtained
on each of the platforms for the 128 MB disk and
512 MB disk respectively. For the Opteron-MP
using the sf buf interface increases the band-
width by about 37%. On the Xeons, the sf buf
interface increases the bandwidth by up to 51%.
Using the private mapping option has no ef-
fect on the Opteron-MP because all local and re-
mote TLB invalidations are avoided by the use of
permanent, one-to-one physical-to-virtual map-
pings. Since there is no sf buf cache on the
Opteron-MP, similar performance is obtained on
both disk sizes.
For the Xeons, the 128 MB disk can be mapped
entirely by the sf buf cache causing no local
and remote TLB invalidations even when the pri-
vate mapping option is eliminated. This is shown
in Figure 5. Hence, using the private mapping
option has negligible effect for the 128 MB disk
as shown in Figure 4. However, the 512 MB disk
cannot be mapped entirely by the sf buf cache.
The sequential disk access of dd causes almost
a 100% cache-miss under the sf buf kernel. Us-
ing the private mapping option reduces the cost
of these cache misses by eliminating remote TLB
invalidations and thus improves the performance,
which is shown in Figure 6. As shown in Fig-
ure 7, the use of the private mapping option elim-
inates all remote TLB invalidations from all Xeon
platforms for the 512 MB memory disk.
     
  






































































































 
 
 
 
 
 
 
 
 
 
 
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	












































  
  










































 
 
 
 
 
 
 
 
 
 
 
 




























 0
 100
 200
 300
 400
 500
 600
 700
 800
B
an
d
w
id
th
 (
M
B
/s
)
Disk Dump: 128 MB Memory Disk
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf: private mapping
sf_buf: default (shared) mapping
original
Figure 4: Disk dump bandwidth in MB/s for 128
MB memory disk
  
  
  
  
 ﬀ ﬁ ﬂ ﬃ   ! " " #
$ $ $
% % %
& ' ( ( ) * + , , - - . . / 1
 10
 100
 1000
 10000
 100000
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
Disk Dump: 128 MB Memory Disk
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf: private mapping
sf_buf: default (shared) mapping
original
Figure 5: Local and remote TLB invalidations is-
sued for the disk dump experiment on 128 MB
memory disk
6.4.2 PostMark
PostMark is a file system benchmark simulating
an electronic mail server workload [9]. It creates
a pool of continuously changing files and mea-
sures the transaction rates where a transaction is
creating, deleting, reading from or appending to
0 0 0
1 1 1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
7
7
7
7
7
7
7
7
7
7
7
7
8 8
8 8
8 8
8 8
8 8
8 8
8 8
8 8
8 8
8 8
8 8
8 8
9 9
9 9
9 9
9 9
9 9
9 9
9 9
9 9
9 9
9 9
9 9
9 9
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
< < <
= = =
>
>
>
>
>
>
>
>
>
>
>
?
?
?
?
?
?
?
?
?
?
?
@
@
@
@
@
@
A
A
A
A
A
A
B B
B B
B B
B B
B B
C
C
C
C
C
D
D
D
D
D
E
E
E
E
E
F
F
F
F
F
F
F
F
G
G
G
G
G
G
G
G
 0
 100
 200
 300
 400
 500
 600
 700
B
an
d
w
id
th
 (
M
B
/s
)
Disk Dump: 512 MB Memory Disk
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf: private mapping
sf_buf: default (shared) mapping
original
Figure 6: Disk dump bandwidth in MB/s for 512
MB memory disk
H H H
I I I
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
L
M N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
P
Q R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
T
U V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
X
Y Z[ \]
^ ^ ^
_ _ _
`
`
`
`
`
`
`
`
`
`
`
`
a
a
a
a
a
a
a
a
a
a
a
a
b
c d d
d d
d d
d d
d d
d d
d d
d d
d d
d d
d d
d d
e e
e e
e e
e e
e e
e e
e e
e e
e e
e e
e e
e e
f
f
f
f
f
f
f
f
f
f
f
f
g
g
g
g
g
g
g
g
g
g
g
g
h
h
h
h
h
h
h
h
h
h
h
h
i
i
i
i
i
i
i
i
i
i
i
i
j j
j j
j j
j j
j j
j j
j j
j j
j j
j j
j j
j j
k
k
k
k
k
k
k
k
k
k
k
k
l
l
l
l
l
l
l
l
l
l
l
l
m
m
m
m
m
m
m
m
m
m
m
m
n n
n n
n n
n n
n n
n n
n n
n n
n n
n n
n n
n n
o
o
o
o
o
o
o
o
o
o
o
o
p
q rs 1
 10
 100
 1000
 10000
 100000
 1e+06
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
Disk Dump: 512 MB Memory Disk
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf: private mapping
sf_buf: default (shared) mapping
original
Figure 7: Local and remote TLB invalidations is-
sued for the disk dump experiment on 512 MB
memory disk
a file. We used the benchmark’s default param-
eters, i.e., block size of 512 bytes and file sizes
ranging from 500 bytes up to 9.77 KB.
We used a 512 MB memory disk for the Post-
Mark benchmark. We used the three prescribed
configurations of PostMark. The first configura-
tion has 1,000 initial files and performs 50,000
transactions. The second has 20,000 files and per-
forms 50,000 transactions. The third configura-
tion has 20,000 initial files and performs 100,000
transactions.
PostMark reports the number of transactions
performed per second (TPS), and it measures the
read and write bandwidth obtained from the sys-
tem. Figure 8 shows the TPS obtained on each
of our platforms for the largest configuration of
PostMark. Corresponding results for read and
write bandwidths are shown in Figure 9. The re-
sults for the two other configurations of PostMark
exhibit similar trends and, hence, are not shown
in the paper.
For the Opteron-MP, using the sf buf inter-
face increased the TPS by about 11% to about
27%. Read and write bandwidth increased by
about 11% to about 17%.
For the Xeon platforms, using the sf buf in-
terface increased the TPS by about 4% to about
13%. Read and write bandwidth went up by
about 4% to 15%. The maximum footprint of
the PostMark benchmark is about 150 MB under
the three configurations used and is completely
mapped by the sf buf cache on the Xeons. We
did not eliminate the use of the private mapping
option on the Xeons for the sf buf kernel as there
were no remote TLB invalidations under these
workloads. The performance improvement on the
Xeons is thus due to the elimination of local and
remote TLB invalidations as shown in Figure 10.
   
 






































 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
































 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	








































































 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
T
ra
n
sa
ct
io
n
s 
p
er
 s
ec
o
n
d
PostMark: 20,000 files/100,000 transactions
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf
original
Figure 8: Transactions per second for PostMark
with 20,000 files and 100,000 transactions
  
 






































































 
 
 
 
 
 
 
 
 
 
 



































































































 
 
 
 
 
 
 
 
 
 
 
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬁ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬂ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
ﬃ ﬃ
















 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
 0
 2
 4
 6
 8
 10
 12
 14
R
ea
d
/W
ri
te
 T
h
ro
u
g
h
p
u
t 
(M
B
/s
)
PostMark: 20,000 files/100,000 transactions
Read Write
Xeon-UP
Read Write
Xeon-HTT
Read Write
Xeon-MP
Read Write
Xeon-MP-HTT
Read Write
Opteron-MP
sf_buf
original
Figure 9: Read/Write Throughput (in MB/s) for
PostMark with 20,000 files and 100,000 transac-
tions
6.5 Networking Subsystem
This section uses two sets of experiments — one
using netperf and another using a web server
— to examine the effects of the sf buf interface
on the networking subsystem.
6.5.1 Netperf
This experiment examines the throughput
achieved between a netperf client and server on
the same machine. TCP socket send and receive
" " "
# #
$ % & ' ( ) * + , - 1
 10
 100
 1000
 10000
 100000
 1e+06
 1e+07
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
PostMark: 20,000 files/100,000 transactions
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf
original
Figure 10: Local and remote TLB invalida-
tions issued for PostMark with 20,000 files and
100,000 transactions
buffer sizes are set to 64 KB for this experi-
ment. Sockets are configured to use zero copy
send. We perform two sets of experiments on
each platform, one using the default Maximum
Transmission Unit (MTU) size of 1500 bytes and
another using a large MTU size of 16K bytes.
Figures 11 and 12 show the network through-
put obtained under the sf buf kernel and the orig-
inal kernel on each of our platforms. The larger
MTU size yields higher throughput because less
CPU time is spent doing TCP segmentation. The
throughput improvements from the sf buf in-
terface on all platforms range from about 4% to
about 34%. Using the larger MTU size makes
the cost of creation of ephemeral mappings a big-
ger factor in network throughput. Hence, the per-
formance improvement is higher when using the
sf buf interface under this scenario.
Reduction in local and remote TLB invalida-
tions explain the above performance improve-
ment as shown in Figures 13 and 14. The
sf buf interface greatly reduces TLB invalida-
tions on the Xeons, and completely eliminates
them on the Opteron-MP.
. . .
. . .
/ / /
/ / /
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
3 3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
6 6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
B
an
d
w
id
th
 (
M
b
it
s/
s)
Netperf: Large MTU
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf
original
Figure 11: Netperf throughput in Mbits/s for
large MTU
     
  






















































 
 
 
 
 
 
 
 
 
 
 
 
 
 








































 
 
 
 
 
 
 
 
 
 
 
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	
	 	



















































 0
 200
 400
 600
 800
 1000
 1200
 1400
B
an
d
w
id
th
 (
M
b
it
s/
s)
Netperf: Small MTU
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf
original
Figure 12: Netperf throughput in Mbits/s for
small MTU
  
  






















 



















































































ﬀ
ﬀ
ﬀ
ﬁ
ﬁ
ﬁ
ﬂ ﬂ ﬃ 1
 10
 100
 1000
 10000
 100000
 1e+06
 1e+07
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
Netperf: Large MTU
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf
original
Figure 13: Local and remote TLB invalidations
issued for Netperf experiments with large MTU
  
   
!
!
!
!
!
!
!
!
!
!
!
!
!
"
"
"
"
"
"
"
"
"
"
"
"
"
#
$ %
%
%
%
%
%
%
%
%
%
%
%
%
&
&
&
&
&
&
&
&
&
&
&
&
&
'
'
'
'
(
(
(
(
)
)
)
)
)
)
)
)
)
)
)
)
)
)
*
*
*
*
*
*
*
*
*
*
*
*
*
*
+
+
+
+
+
,
,
,
,
,
-
-
-
-
-
-
-
-
-
-
-
-
-
-
.
.
.
.
.
.
.
.
.
.
.
.
.
.
/
/
/
0
0
0
1 1
2 34 1
 10
 100
 1000
 10000
 100000
 1e+06
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
Netperf: Small MTU
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf
original
Figure 14: Local and remote TLB invalidations
issued for Netperf experiments with small MTU
6.5.2 Web Server
We used apache 2.0.50 as the web server on
each of our platforms. We ran an emulation of
30 concurrent clients on a separate machine to
generate a workload on the server. The server
and client machines were connected via a Giga-
bit Ethernet link. Apache was configured to use
sendfile(2). For this experiment we mea-
sure the throughput obtained from the server and
count the number of local and remote TLB inval-
idations on the server machine. The web server
was subject to real workloads of web traces from
NASA and Rice University’s Computer Science
Department that have been used in published lit-
erature [7, 15]. For the rest of this paper we refer
to these workloads as the NASA workload and
the Rice workload respectively. These workloads
have footprints of 258.7 MB and 1.1 GB respec-
tively.
Figures 15 and 16 show the throughput for all
the platforms using both the sf buf kernel and
the original kernel for the NASA and the Rice
workloads respectively. For the Opteron-MP, the
sf buf kernel improves performance by about 6%
for the NASA workload and about 14% for the
Rice workload. The reasons behind these perfor-
mance improvements are the same as described
earlier in Section 6.2.
For the Xeons, using the sf buf kernel results
in performance improvement of up to about 7%.
This performance improvement is a result of the
reduction in local and remote TLB invalidations
as shown in Figures 17 and 18.
For the above experiments the Xeon platforms
employed an sf buf cache of 64K entries. To
study the effect of the size of this cache on web
server throughput we reduced it down to 6K
entries. A smaller cache causes more misses,
thus increasing the number of TLB invalidations.
Implementation of the sf buf interface on the
i386 architecture employs an optimization which
avoids TLB invalidations if the page table entry’s
(PTE) access bit is clear. With TCP checksum
offloading enabled, the CPU does not touch the
pages to be sent, and as a result the corresponding
PTEs have their access bits clear and on a cache
miss TLB invalidation is avoided. With TCP
checksum offloading disabled, the CPU touches
the pages and the corresponding PTEs, causing
TLB invalidations on cache misses. So for each
cache size we did two experiments, one with TCP
checksum offloading enabled and the other by
disabling it.
Figure 19 shows the throughput for the NASA
workload on the Xeon-MP for the above ex-
periment. For larger cache size slightly higher
throughput is obtained because of more reduction
in local and remote TLB invalidations as shown
in Figure 20. Also, enabling checksum offloading
brings local and remote TLB invalidations fur-
ther down because of the access bit optimization.
Reducing the cache size from 64K to 6K entries
does not significantly reduce throughput because
the hit rate of the ephemeral mapping cache drops
from nearly 100% to about 82%. This lower
cache hit rate is sufficient to avoid any noticeable
performance degradation.
   
 


















































































































 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	














































































 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
T
h
ro
u
g
h
p
u
t 
(M
b
it
s/
s)
NASA Workload
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf
original
Figure 15: Throughput (in Mbits/s) for the NASA
workload
 
 
























































































 
 
 
 
 
 
 
 
 
 
 
 
 
 






























































 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
T
h
ro
u
g
h
p
u
t 
(M
b
it
s/
s)
Rice Workload
Xeon-UP Xeon-HTT Xeon-MP Xeon-MP-HTT Opteron-MP
sf_buf
original
Figure 16: Throughput (in Mbits/s) for the Rice
workload
7 Related Work
Chu describes a per process mapping cache for
zero-copy TCP in Solaris 2.4 [6]. Since the
cache is not shared among all processes in the
system its benefits are limited. For example
a multi-processed web server using FreeBSD’s
sendfile(2), like apache 1.3, will not get the
maximum benefit from the cache if more than one
process transmit the same file. In this case the file
pages are the same for all processes so having a
common cache would serve best.
  
  












ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬀ
ﬁ
ﬂ ﬃ
ﬃ
ﬃ
ﬃ
ﬃ
ﬃ
ﬃ
ﬃ
ﬃ
ﬃ
ﬃ
ﬃ












 
 
 
!
!
!
"
"
"
"
"
"
"
"
"
"
"
"
"
#
#
#
#
#
#
#
#
#
#
#
#
#
$
% &
&
&
&
&
&
&
&
&
&
&
&
'
'
'
'
'
'
'
'
'
'
'
'
(
(
(
(
)
)
)
)
*
+ ,- 10
 100
 1000
 10000
 100000
 1e+06
 1e+07
 1e+08
 1e+09
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
NASA Workload
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf
original
Figure 17: Local and remote TLB invalidations
issued for the NASA workload
. . .
/ / /
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2 3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
8
8
9
9
9
:
:
:
;
;
;
;
;
;
;
;
;
;
;
;
;
;
<
<
<
<
<
<
<
<
<
<
<
<
<
<
=
=
=
=
=
=
=
=
>
>
>
>
>
>
>
>
? @ 1
 10
 100
 1000
 10000
 100000
 1e+06
 1e+07
 1e+08
 1e+09
T
L
B
 I
n
v
al
id
at
io
n
s 
Is
su
ed
Rice Workload
Local Remote
Xeon-UP
Local Remote
Xeon-HTT
Local Remote
Xeon-MP
Local Remote
Xeon-MP-HTT
Local Remote
Opteron-MP
sf_buf
original
Figure 18: Local and remote TLB invalidations
issued for the Rice workload
A A A
B B
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
D D
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
E E
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
F F
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
G G
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
 0
 100
 200
 300
 400
 500
 600
 700
 800
Th
ro
ug
hp
ut
 (M
bit
s/s
)
NASA Workload
64K cache entries 6K cache entries No cache (original)
enable checksum offloading
disable checksum offloading
Figure 19: Throughput (in Mbits/s) for the Nasa
workload on Xeon-MP with the sf buf cache
having 64K or 6K entries and the original ker-
nel and with TCP checksum offloading enabled
or disabled
   
 























 


































	
	
	
	
	
	
	
	
	
	
	

 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 


 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
















































 1000
 10000
 100000
 1e+06
 1e+07
 1e+08
TL
B
 In
va
lid
at
io
ns
 Is
su
ed
NASA Workload
Local Remote
64K cache entries
Local Remote
6K cache entries
Local Remote
No cache (original)
enable checksum offloading
disable checksum offloading
Figure 20: Local and remote TLB invalidations
issued for the NASA workload on Xeon-MP with
the sf buf cache having 64K or 6K entries and
the original kernel and with TCP checksum of-
floading enabled or disabled
Ruan and Pai extend the mapping cache to
be shared among all processes [13]. This cache
is sendfile(2)-specific. Our work extends
the benefits of the shared mapping cache to ad-
ditional kernel subsystems providing substantial
modularity and performance improvements. We
provide a uniform API for processor specific im-
plementations. We also study the effect of the
cache on multiprocessor systems.
Bonwick and Adams [3] describe Vmem, a
generic resource allocator, where kernel virtual
addresses are viewed as a type of resource. How-
ever, the goals of our work are different from that
of Vmem. In the context of kernel virtual address
resource, Vmem’s goal is to achieve fast alloca-
tion and low fragmentation. However, it makes
no guarantee that the allocated kernel virtual ad-
dresses are ”safe”, i.e., they require no TLB in-
validations. In contrast, the sf buf interface re-
turns kernel virtual addresses that are completely
safe in most cases, requiring no TLB invalida-
tions. Additionally, the cost of using the sf buf
interface is small.
Bala et al. [1] design a software cache of TLB
entries to provide fast access to entries on a TLB
miss. This cache mitigates the costs of a TLB
miss. The goal of our work is entirely different:
to maintain a cache of ephemeral mappings. Be-
cause our work re-uses address mappings, it can
augment such a cache of TLB entries. This is be-
cause the mappings corresponding to the physi-
cal pages with entries in the sf buf interface do
not need to change in the software TLB cache. In
other words, the effectiveness of such a cache (of
TLB entries) can be increased with the sf buf
interface.
Thekkath and Levy [14] explore techniques for
achieving low-latency communication and imple-
ment a low-latency RPC system. They point out
re-mapping as one of the sources of the cost of
communication. On multiprocessors, this cost is
increased due to TLB coherency operations [2].
The sf buf interface obviates the need for re-
mapping and hence lowers the cost of communi-
cation.
8 Conclusions
Modern operating systems create ephemeral
virtual-to-physical mappings for a variety of pur-
poses, ranging from the implementation of inter-
process communication to the implementation of
process tracing and debugging. The hardware
costs of creating these ephemeral mappings are
generally increasing with succeeding generations
of processors. Moreover, if an ephemeral map-
ping is to be shared among multiprocessors, those
processors must act to maintain the consistency
of their TLBs. In this paper we have provided a
software solution to alleviate this problem.
In this paper we have devised a new abstrac-
tion to be used in the operating system kernel,
the ephemeral mapping interface. This interface
allocates ephemeral kernel virtual addresses and
virtual-to-physical address mappings. The inter-
face is low cost, and greatly reduces the number
of costly interprocessor interrupts. We call our
ephemeral mapping interface as the sf buf in-
terface. We have described its implementation
in the FreeBSD-5.3 kernel on two representative
architectures — the i386 and the amd64, and
outlined its implementation for the three other
architectures supported by FreeBSD. Many ker-
nel subsystems—pipes, memory disks, sockets,
execve(), ptrace(), and the vnode pager—
benefit from using the sf buf interface. The
sf buf interface also centralizes redundant code
from each of these subsystems, reducing their
overall size.
We have evaluated the sf buf interface for
the pipe, memory disk and networking subsys-
tems. For the bw pipe program of the lmbench
benchmark [10] the bandwidth improved by up to
about 168% on one of our platforms. For mem-
ory disks, a disk dump program resulted in about
37% to 51% improvement in bandwidth. For the
PostMark benchmark [9] on a memory disk we
demonstrate up to 27% increase in transaction
throughput. The sf buf interface increases net-
perf throughput by up to 34%. We also demon-
strate tangible benefits for a web server workload
with the sf buf interface. In all these cases, the
ephemeral mapping interface greatly reduced or
completely eliminated the number of TLB inval-
idations.
Acknowledgments
We wish to acknowledge Matthew Dillon of
the DragonFly BSD Project, Tor Egge of the
FreeBSD Project, and David Andersen, our shep-
herd. Matthew reviewed our work and incorpo-
rated it into DragonFly BSD. He also performed
extensive benchmarking and developed the im-
plementation for CPU-private mappings that is
used on the i386. Tor reviewed parts of the i386
implementation for FreeBSD. Last but not least,
David was an enthusiastic and helpful participant
in improving the presentation of our work.
References
[1] K. Bala, M. F. Kaashoek, and W. E. Weihl. Soft-
ware Prefetching and Caching for Translation Looka-
side Buffers. In First Symposium on Operating Systems
Design and Implementation, pages 243–253, Monterey,
California, Nov. 1994.
[2] D. L. Black, R. F. Rashid, D. B. Golub, C. R. Hill,
and R. V. Baron. Translation lookaside buffer consis-
tency: A software approach. In Proceedings of the Third
International Conference on Architectural Support for
Programming Languages and Operating Systems, pages
113–122, Dec. 1989.
[3] J. Bonwick and J. Adams. Magazines and Vmem: Ex-
tending the Slab Allocator to Many CPUs and Arbitrary
Resources. In USENIX Annual Technical Conference,
Boston, Massachussetts, June 2001.
[4] J. C. Brustoloni and P. Steenkiste. Effects of buffering
semantics on i/o performance. In Operating Systems
Design and Implementation, pages 277–291, 1996.
[5] R. Cheng. Virtual address cache in Unix. In Proceed-
ings of the 1987 Summer USENIX Conference, pages
217–224, 1987.
[6] H.-K. J. Chu. Zero-copy TCP in Solaris. In USENIX
1996 Annual Technical Conference, pages 253–264,
Jan. 1996.
[7] K. Elmeleegy, A. Chanda, A. L. Cox, and
W. Zwaenepoel. Lazy asynchronous i/o for event-
driven servers. In USENIX 2004 Annual Technical
Conference, June 2004.
[8] A. Gallatin, J. Chase, and K. Yocum. Trapeze/IP:
TCP/IP at near-gigabit speeds. In Proceedings of the
FREENIX Track: 1999 USENIX Annual Technical Con-
ference, June 1999.
[9] J. Katcher. Postmark: A new file system benchmark. At
http://www.netapp.com/tech library/3022.html.
[10] L. McVoy and C. Stalien. Lmbench - tools for perfor-
mance analysis. At http://www.bitmover.com/lmbench/,
1996.
[11] R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron,
D. Black, W. Bolosky, and J. Chew. Machine-
independent virtual memory management for paged
uniprocessor and multiprocessor architectures. In Pro-
ceedings of the Second International Conference on Ar-
chitectural Support for Programming Languages and
Operating Systems, pages 31–39, 1987.
[12] B. S. Rosenburg. Low-synchronization translation
lookaside buffer consistency in large-scale shared-
memory multiprocessors. In Proceedings of the 12th
ACM Symposium on Operating System Principles,
pages 137–146, Litchfield Park, AZ, Dec. 1989.
[13] Y. Ruan and V. Pai. Making the Box transparent: Sys-
tem call performance as a first-class result. In USENIX
2004 Annual Technical Conference, pages 1–14, June
2004.
[14] C. A. Thekkath and H. M. Levy. Limits to Low-Latency
Communication on High-Speed Networks. ACM Trans-
actions on Computer Systems, 11(2):179–203, May
1993.
[15] H. youb Kim, V. S. Pai, and S. Rixner. Increasing
Web Server Throughput with Network Interface Data
Caching. In Architectural Support for Programming
Languages and Operating Systems, pages 239–250, San
Jose, California, Oct. 2002.
