Flexible hardware acceleration for instruction-grain program monitoring by Chen, Shimin et al.
Flexible Hardware Acceleration for Instruction-Grain Program Monitoring
Shimin Chen1, Michael Kozuch1, Theodoros Strigkos2, Babak Falsafi3, Phillip B. Gibbons1,
Todd C. Mowry1,2, Vijaya Ramachandran4, Olatunji Ruwase2, Michael Ryan1, Evangelos Vlachos2
1Intel Research Pittsburgh 2Carnegie Mellon University
3 ´Ecole Polytechnique Fe´de´rale de Lausanne 4University of Texas at Austin
Abstract
Instruction-grain program monitoring tools, which check
and analyze executing programs at the granularity of indi-
vidual instructions, are invaluable for quickly detecting bugs
and security attacks and then limiting their damage (via con-
tainment and/or recovery). Unfortunately, their fine-grain
nature implies very high monitoring overheads for software-
only tools, which are typically based on dynamic binary in-
strumentation. Previous hardware proposals either focus on
mechanisms that target specific bugs or address only the
cost of binary instrumentation. In this paper, we propose
a flexible hardware solution for accelerating a wide range
of instruction-grain monitoring tools. By examining a num-
ber of diverse tools (for memory checking, security track-
ing, and data race detection), we identify three significant
common sources of overheads and then propose three novel
hardware techniques for addressing these overheads: Inheri-
tance Tracking, Idempotent Filters, and Metadata-TLBs. To-
gether, these constitute a general-purpose hardware acceler-
ation framework. Experimental results show our framework
reduces overheads by 2–3X over the previous state-of-the-art,
while supporting the needed flexibility.
1 Introduction
Systems designers have traditionally focused on maximiz-
ing performance, and more recently on minimizing power.
From a user’s perspective, however, both of these issues are
of secondary concern when the software is misbehaving. In
other words, if the software is broken, it is little consolation
that it is misbehaving quickly or power-efficiently. As sys-
tems have become faster over the years, the corresponding in-
creases in both software and hardware complexity have raised
concerns that applications and systems are becoming increas-
ingly error-prone. While writing bug-free code has always
been difficult, recent studies suggest that bug rates are get-
ting worse over time as software complexity increases [4], de-
spite the software industry’s pre-release testing efforts. In a
networked world, even obscure bugs—benign under normal
conditions—can leave a system vulnerable to security attacks
after the code has been released [31].
There is a long history of developing tools to help diagnose
and fix software problems. These tools can be invoked at var-
ious phases of the software development and execution cycle:
e.g., static tools attempt to identify problems before the pro-
gram executes [2, 9, 11], post-mortem tools attempt to recon-
struct what went wrong after the application crashes [15, 39,
40], and dynamic tools—which we call lifeguards—monitor
an application as it executes to diagnose and hopefully ei-
ther contain or fix problems [1, 10, 14, 20, 27]. These three
classes of tools are generally complementary. Moreover, the
granularity of software execution events that lifeguards care
about form a spectrum, from system-call-level [12, 24] to
instruction-level [18, 20, 21, 25, 27]. Instruction-grain life-
guards, which perform invariant checking and analysis at the
granularity of individual instructions, have two unique ad-
vantages: (i) highly-detailed information regarding dynamic
events, such as memory references, address computations, and
information flow, is available at the instruction level, and (ii)
software errors may be captured earlier and more accurately.
The former enables a wide range of powerful lifeguards (e.g.,
detecting memory-access violations [18, 20], data races [27],
and security exploits [21, 25]), while the latter provides a
better starting point for damage containment, on-site diagno-
sis [32], and hopefully on-the-fly fixes and recovery [26].
Instruction-grain lifeguards are too slow. Unfortunately,
existing approaches to instruction-grain lifeguards are very
slow, because lifeguard functionality is invoked on (nearly)
every instruction. Software-only solutions (mainly based on
dynamic binary instrumentation, DBI [1, 14, 20]) typically
slow down the monitored program by 10–100 times [17, 20,
33], limiting their utility to program debugging back at the
development sites. To address this overhead, a number of
hardware optimizations have been proposed, each tailored to
a specific class of lifeguards: e.g., memory-access monitor-
ing [28, 35, 42], data-race detection [41], and information-
flow tracking with simple metadata [7, 8, 30, 34]. Unfortu-
nately, these mechanisms are useful only for the narrow class
of lifeguards that they support. Other studies [3, 5] have
proposed more general-purpose hardware solutions. For ex-
ample, our previous study proposed Log-Based Architectures
(LBA) [3], which capture a log from a monitored program
and ship it to another on-chip core that executes the moni-
toring functionality. An earlier study [5] proposed Dynamic
Instruction Stream Editing (DISE), which performs pattern-
matching-based dynamic rewriting of a processor’s instruc-
tion stream to insert calls to monitoring code. Both DISE and
our earlier work on LBA focus only on reducing the costs of
DBI, such as reducing the resource competition between mon-
itored programs and lifeguards. As a result, the instruction-
grain monitoring overhead, although significantly reduced, is
still large (e.g., 3–5X slowdowns [3]).
Fast and flexible hardware acceleration. In this paper, we
study hardware acceleration for speeding up a diverse range
of instruction-grain lifeguards. Figure 1 depicts the general
setting we consider, which reflects existing general-purpose
lifeguard platforms such as DISE and LBA. On the left, an
event-capture runtime observes the instructions executed by
rare events
update events
event-capture
runtime
(e.g., DISE, LBA)
application
event
mux
malloc/free            - R
lock/unlock            - R
syscalls, etc            - R
addr. computation - C
memory access     - C
data movement     - U
computation         - U
ra
re
fr
eq
u
en
t
event stream lifeguard
rare events
update metada a
check metadata
h
an
d
lers
IT
IF
metadata
alloc/init
R
C
U
M-TLB
Figure 1. Our framework targets metadata updates (via Inheritance Tracking (IT)), metadata
checks (via Idempotent Filters (IF)) and metadata mapping (via Metadata-TLB (M-TLB)).
IT IF M-TLB
ADDRCHECK yes yes
MEMCHECK yes yes yes
TAINTCHECK yes yes
TAINTCHECK w/ yes yes
detailed tracking
LOCKSET yes yes
Figure 2. Applying our hardware
acceleration framework to the life-
guards in this study. See Table 1
for lifeguard descriptions.
the monitored program, and creates a corresponding stream
of event records. The dashed box on the left shows examples
of rare and frequent events of interest. On the right, a lifeguard
tracks the state of the application (e.g., which memory regions
have been allocated) by maintaining application metadata. To
consume the event stream, the lifeguard issues event handlers
that may update its metadata, use the metadata to check the
event against some invariant, or both.
This paper presents a hardware acceleration framework
for instruction-grain lifeguards. By analyzing a number of
lifeguards representing diverse monitoring requirements, we
identify three main sources of overheads for instruction-grain
lifeguards. Our framework provides novel techniques for ad-
dressing each of these sources:
(i) Propagation tracking. One key source of lifeguard over-
head is propagation tracking (or dynamic information flow
tracking [30]). Given an executed application instruction I ,
the metadata of I’s destination location is computed as a com-
bination of the metadata of all I’s source locations (e.g., if one
of I’s sources contains “suspect” data, then I’s destination
now contains “suspect” data). Previous work on propagation
tracking [7, 8, 21, 30] has considered only such generic prop-
agation tracking, which is challenging to optimize because of
the strong dependence among application instructions. Our
key insights are that generic propagation tracking is more
general than needed for important invariants, that a more re-
stricted version suffices, and moreover, that such a version can
be supported with significantly lower overhead using a novel
technique called Inheritance Tracking (IT).
(ii) (Redundant) checking. For many lifeguards, checking
is performed on every memory reference and/or on every
address computation. To reduce this second key source of
overhead, our idea is to take advantage of the fact that life-
guards do not need to check “redundant” events, where the
corresponding metadata are not changed. Hence, we design
hardware Idempotent Filters (IF) to identify (and discard) re-
dundant events without the overhead of accessing the meta-
data. Previously, redundant event filtering has been applied to
address-checking lifeguards using other mechanisms [28], but
we show how our technique can be applied more generally.
(iii) Metadata mapping. The third key source of overhead is
the mapping from an application address to the corresponding
metadata location(s) that is performed in almost every life-
guard event handler. We propose Metadata-TLB (M-TLB), a
novel mechanism that speeds up such metadata address trans-
lation. The M-TLB resides in user space. It can be configured
at runtime for a variety of metadata sizes and organizations.
In this way, our framework achieves the performance of
lifeguard-specific hardware approaches, while supporting a
wide range of lifeguards, as illustrated in Figure 2.
Contributions. This paper makes the following main con-
tributions. First, we analyze a number of diverse life-
guards to understand the requirements and commonalities of
instruction-grain lifeguards, identifying three main sources of
lifeguard overheads (Section 2). We use LBA as a case study
lifeguard platform (Section 3). Second, we propose a hard-
ware acceleration framework highlighted by three novel tech-
niques for addressing these overheads: Inheritance Tracking
(IT) for accelerating propagation-style metadata updates (Sec-
tion 4), Idempotent Filters (IF) for identifying and discard-
ing redundant checking events that will not alter metadata
state (Section 5), and Metadata-TLBs (M-TLB) for acceler-
ating the translation from application address to metadata ad-
dress (Section 6). Finally, we implemented our techniques
within LBA and evaluated them through simulation studies
(Section 7). Our experiments with CPU-intensive benchmark
programs (the most challenging case, because of their high in-
struction rate) show a 2–3X reduction in LBA’s overheads for
all of the studied lifeguards, down to a 2–51% overall slow-
downs for all but one of the lifeguards.
2 Requirements for Efficiently Supporting
Instruction-Grain Lifeguards
Our goal is to provide hardware accelerators for a wide
range of instruction-grain lifeguards, addressing the major
performance bottlenecks in such lifeguards while being suf-
ficiently flexible to support their diverse requirements. This
section analyzes a range of lifeguards in order to understand
their commonalities and differences. We identify three signif-
icant, common sources of overheads for lifeguards, as well as
five important axes on which lifeguards differ even in these
common sources. Finally, we discuss related work in the con-
text of these commonalities and differences.
2.1 Understanding Instruction-Grain Lifeguards
We focus on four diverse instruction-grain lifeguards that
detect memory violations, security exploits, and data races, as
detailed in Table 1. In the table, we describe for each life-
guard its purpose, the basic idea, the metadata maintained,
and when/how metadata updates and checks are performed.
To simplify the descriptions, we ignore checks that consume
only a negligible fraction of the lifeguard’s time.
From Figure 1 and Table 1, we see that execution events
in the monitored program correspond to specific lifeguard
Table 1. Example instruction-grain lifeguards.
A
D
D
R
C
H
E
C
K
[1
7]
Purpose: Check whether every memory access is to an allocated region of memory.
Idea: By intercepting memory allocation routines such as malloc and free, ADDRCHECK maintains metadata for each byte of
the monitored program’s address space indicating whether or not that byte is currently accessible. The metadata are checked on
every memory operation for (mainly) invalid heap accesses and invalid stack accesses.
Metadata: One “accessible” bit per address byte of the monitored program.
Metadata updates: malloc and free events change the accessible state of the affected heap blocks.
Metadata checks: For every memory access, check the accessible bit for the memory address.
Auxiliary structures: A list of records for observed malloc’s and a list of records for observed free’s, which are used to detect
double free’s, invalid free’s, and memory leaks.
M
E
M
C
H
E
C
K
[1
8,
19
]
Purpose: Extend ADDRCHECK to detect the use of uninitialized values.
Idea: A memory load of an uninitialized value is not an error in itself (e.g., copying a partially initialized structure). Rather, an error
is raised only if uninitialized values are dereferenced as pointers, used in conditional tests, or passed into system calls. To achieve
this, MEMCHECK tracks the propagation of uninitialized values in the monitored program.
Metadata: Same as ADDRCHECK + one “initialized” bit per address byte + “initialized” state per register.
Metadata updates: Accessible bits are updated as in ADDRCHECK. Initialized bits are cleared after free’s and set for constant
value writes and system call returns. Propagation tracking: For every executed instruction, the destination becomes uninitialized
if at least one of the sources is uninitialized.
Metadata checks: Accessible bits are checked as in ADDRCHECK. Initialized bits are checked for base/index registers of memory
accesses, conditional test inputs, and system call inputs.
Auxiliary structures: Same as ADDRCHECK.
TA
IN
T
C
H
E
C
K
[2
1]
Purpose: Detect overwrite-related security exploits (e.g., due to buffer overruns and format string vulnerabilities).
Idea: All unverified program input data, such as data from the network, are marked as suspect, or tainted. Subsequently, the
propagation of tainted data through the program is carefully tracked. If a tainted value is loaded from memory, the destination
register is marked as tainted. A computation destination is marked as tainted if a source is tainted. An error is raised if tainted
data are used in critical ways, such as in jump target addresses, format strings of printf-like calls, or system call arguments.
Metadata: One “tainted” bit per address byte of the monitored program + “tainted” state per register.
Metadata updates: Set “tainted” bits to untainted after malloc’s. Memory buffers used in read or recv system calls obtaining
data from untrusted sources are marked as tainted. Tainted status is propagated for every executed instruction.
Metadata checks: Check tainted status of register or memory locations of indirect jump target addresses, format strings of printf-
like calls, and system call arguments.
TaintCheck with detailed tracking: We also study a TAINTCHECK variant that records a history of the taint propagation, using an
8-byte metadata structure (4-byte “from” address, 4-byte instruction pointer) per 4-byte application word (see Section 7.1).
L
O
C
K
SE
T
[2
7]
Purpose: Detect data races by checking whether accesses to shared memory locations are protected by consistent sets of locks.
Idea: For each thread t, LOCKSET maintains the current set St of locks held by the thread. For each shared memory location m, it
maintains a candidate set Sm of locks. m is known to be a shared location if a second thread accesses it; at this moment, Sm is
initialized with the current lock set of the second thread. Afterwards, whenever a thread t references m, Sm is set to Sm ∩ St. If
Sm ever becomes empty, then no consistent common lock set protects accesses to m, and an error is raised.
Metadata: A lockset is implemented as a sorted list of lock addresses. A 32-bit record is maintained for every 4-byte word in the
monitored program. It consists of a compressed 30-bit pointer to the actual candidate lockset and a 2-bit state for the location
(virgin, exclusive, shared read-only, shared read-write). If the state is exclusive, the 30-bit pointer is reused for recording its
owner thread ID. For every thread, an uncompressed pointer to the thread’s current lockset is maintained.
Metadata updates: malloc’s set metadata to be virgin. A virgin location becomes exclusive after the first memory access. An
exclusive location becomes shared after a memory access from a thread other than the owner. For shared locations, lockset
intersection is performed for every memory access.
Metadata checks: For every shared memory access, check if the resulting lockset after intersection is empty.
Auxiliary structures: All the known locksets, each being a sorted list of lock addresses.
handler functionality, and that the interesting events are typ-
ically either important library/system calls or individual in-
structions. The former are often expensive to handle, but rare,
while the latter are inexpensive individually, but so frequent
that lifeguard performance is very sensitive to their cost.
Table 1 shows that the role of many instruction-grain event
handlers centers around accessing lifeguard metadata (also
called shadow values [20]), which represent the state infor-
mation maintained by lifeguards regarding the monitored ap-
plication’s address space (e.g., which addresses have been
tainted), often including application registers. Consequently,
these handlers perform the following three major activities:
(1) Metadata updates: While some metadata (e.g., AD-
DRCHECK’s accessible bits and LOCKSET’s 32-bit records)
are updated only at infrequent events such as library calls, oth-
ers are updated at nearly every monitored instruction. The lat-
ter metadata arise in propagation tracking lifeguards such as
MEMCHECK and TAINTCHECK, which propagate metadata
status from sources to destinations on every instruction. Be-
cause each such update takes multiple instructions to perform,
propagation tracking is a key source of lifeguard overhead.
(2) Metadata checks: Checks are often performed in
instruction-grain events (e.g., for every application memory
access in ADDRCHECK and MEMCHECK and every shared
memory access in LOCKSET). Intuitively, in a well-behaved
program, metadata converge into stable states quickly. This
insight can be exploited in lifeguard handlers by checking the
frequent case—stable state—in a fast path while branching
into a slow path for more detailed checks. However, even the
most optimized checking operation has to do metadata access,
comparison, and branch. Thus, (redundant) metadata checks
are a second key source of lifeguard overhead.
(3) Metadata mapping: An address of the monitored appli-
cation is mapped into a metadata location (e.g., an application
address is mapped to an accessible bit in ADDRCHECK). This
operation involves a sequence of mask and shift instructions,
which often takes a significant portion of handler instructions
(as high as half of the instructions, as will be illustrated in Sec-
tion 6). Because this translation is required for every metadata
check and update in all our lifeguards, metadata mapping is a
third key source of lifeguard overhead.
As revealed by our experimental study in Section 7, these
three sources of overheads constitute a significant fraction of
the lifeguards’ execution times.
Finally, as shown in Table 1, there are several important
differences in the way that metadata are used by the four
lifeguards. These differences fall along five axes: (i) unit
for metadata (memory or register, per-byte or per-word), (ii)
metadata bits per unit (1–64 bits); (iii) metadata semantics
(e.g., LOCKSET’s pointer plus state encoding); (iv) whether
the metadata are propagation-tracking; and (v) use of auxil-
iary structures. These differences demand flexibility in the
underlying support platform.
2.2 Related Work on Hardware Proposals
for Instruction-Grain Lifeguards
Lifeguard-specific techniques. One class of prior hardware
proposals achieves low overhead, but targets only one or a
subset of lifeguards. HARD [41] implements hardware-based
LOCKSET by extending every cache line with a hardware
bloom filter representing the candidate lockset and augment-
ing the snoopy-based cache coherence protocol for lockset
communication and updates. MemTracker [35] focuses only
on memory access monitoring (propagation tracking, for in-
stance, is not supported). The scheme adds an extra processor
pipeline stage that performs lifeguard metadata updates and
checks based on a state transition table with events such as
load, store, alloc, and free. It cannot handle flexible meta-
data semantics such as that in LOCKSET. Similarly, several
recent hardware proposals [7, 8, 30, 34] focus on improving
the performance of propagation tracking—also known as dy-
namic information flow tracking (DIFT)—by introducing mi-
croarchitectural changes to enable the processor to manage
metadata directly. While several of these proposals provide
some policy flexibility (addressing axis (ii), above), they still
restrict the metadata format and semantics (axes (i) and (iii))
to reduce the hardware complexity; hence, lifeguard general-
ity is sacrificed. For example, they do not support non-DIFT
lifeguards such as LOCKSET and cannot support propagation
tracking with more flexible metadata such as TAINTCHECK
with detailed tracking.
Techniques targeting binary instrumentation cost. The
second class of hardware proposals strive to support a wide
range of lifeguards. However, they focus only on reducing the
P
PP
P P P P
P
P
P
P
PPPPP
Core 1 Core 2
fetch & 
decompress
compress
& store
log record
capture
log record
dispatch
log transport via cache
Application Lifeguard
log producer components log consumer components This paper
M-TLB
IT&IF
last-level
on-chip
cache
Figure 3. LBA on a many-core processor, including the compo-
nents (IT, IF, M-TLB) proposed in this paper.
costs incurred by dynamic binary instrumentation [1, 14, 20],
the state-of-the-art software-only lifeguard platform. In par-
ticular, they seek to reduce (i) code translation costs for in-
serting event handlers into application code; (ii) competition
for resources such as cycles, registers and caches; and/or (iii)
the cost of re-creating hidden instruction states, such as mem-
ory addresses in IA32. DISE [5] removes cost (i) by extending
the instruction fetch and decode unit with a macro-expansion
capability for pattern-matching triggering events and insert-
ing lifeguard event handlers on the fly. It also reduces cost
(ii) somewhat by providing special registers for lifeguards.
iWatcher [42] enables event handlers to be associated with ad-
dress ranges, triggers event handlers based on cache line tags,
and executes handlers and the monitored program in separate
hardware threads, thus reducing all three costs. However,
the scheme does not support propagation tracking or large
tags. Several recent studies (Heapmon [28], INDRA [29],
and LBA [3]) propose to take advantage of many-core proces-
sors and run lifeguards on otherwise idle cores. While Heap-
mon and INDRA studied only specific lifeguards, LBA is a
general-purpose replacement for the dynamic binary instru-
mentation approach, reducing all three costs.
In contrast to this prior work, we provide hardware accel-
erators that address the three major performance bottlenecks
in instruction-grain lifeguards, while being sufficiently flexi-
ble to support their diverse requirements.
3 Case Study: Log-Based Architectures
As our work was done in the context of the Log-Based
Architectures (LBA) project, we will use LBA as our running
case study. Note, however, that our framework is not tied to
LBA and can also be used to accelerate any lifeguard platform
having the generic structure of Figure 1 (e.g., DISE [5]).
Figure 3 depicts a many-core processor enhanced with
LBA producer components (darker/green rectangles with out-
going arrows) and LBA consumer components (lighter/yellow
rectangles with incoming arrows) for every core. Given users’
relative preferences of performance, power, and correctness,
lifeguard monitoring can be dynamically enabled. In Figure 3,
the dashed/blue rectangles on the consumer side are the new
components proposed in this paper.
The zoom-in picture in Figure 3 shows that a lifeguard run-
ning on core 2 is monitoring an unmodified application run-
ning on core 1. As an application instruction retires, LBA
captures a record, compresses it, and transports it through a
buffer in on-chip cache. An instruction record (conceptually)
consists of the program counter, instruction type, input/output
(1) mov A, %eax
(2) add B, %eax
(3) shr 8, %eax
(4) mov C, %ecx
(5) and 0xff, %ecx
(6) sub %ecx, %eax
(7) mov %eax, D
(8) mov E, %eax
(9) mov %eax, F
mem_to_reg
dest_reg_op_mem
mem_to_reg
dest_reg_op_reg
reg_to_mem
mem_to_reg
reg_to_mem
1
2
7
6
4
5
3
8
9
A B C E
D
F
A−F: memory locations; rtaint/mtaint: taint values for register/memory; IT: filter state; xx_to_xx: delivered event
rtaint(%eax)=mtaint(A)
rtaint(%eax) |= mtaint(B)
rtaint(%ecx)=mtaint(C)
rtaint(%eax) |= rtaint(%ecx)
mtaint(D)=rtaint(%eax)
mtaint(F)=rtaint(%eax)
rtaint(%eax)=mtaint(E)
Application Data Flow Event after Filtering
mem_to_mem
imm_to_mem (D)
(E,F)
Instruction Sequence
Example Application Events Delivered
To Lifeguard
TaintCheck Operation
For Each Event
IT Filter State &
IT(%eax)=addr(A)
IT(%eax)=clear
IT(%ecx)=addr(C)
IT(%eax)=clear
IT(%eax)=addr(E)
Figure 4. Propagation tracking for an example sequence of application instructions.
operand identifiers, and any data addresses (the compressed
log records are less than a byte [3]). In addition, LBA supports
software-inserted annotation records representing high-level
events (e.g., malloc library calls), which can be captured via
wrapper libraries. The consumer components support event-
driven execution. A lifeguard is organized as a set of event
handlers registered with LBA in an event type configuration
table (ETCT); each instruction record corresponds to one or
more events. Every handler ends with a special control trans-
fer instruction—nlba (next LBA event). nlba does not have
target addresses, instead it determines the event type from log
records, looks up the ETCT, and changes program control to
the entrance of the registered handler. Certain event values
(such as data addresses) are automatically placed in registers
for ready handler access.
LBA reduces the producer-consumer synchronization
overhead by using a large (64KB–1MB) log buffer; however,
if the buffer becomes full (empty), the application (lifeguard)
must stall. Because of the decoupled nature of execution and
checking, bug detection at the lifeguard lags bug occurrence at
the application. LBA relies on OS level support for fault con-
tainment. The monitored application is stalled at each syscall
until the lifeguard finishes checking all the remaining records
in the log buffer; this prevents any damage from propagating
into the OS kernel and affecting other applications.
4 Optimizing Propagation Tracking
Lifeguards such as TAINTCHECK and MEMCHECK (and
generally all DIFT lifeguards [30]) track the propagation of
data characteristics through the application’s address space.
While lifeguards that do not track data propagation may ig-
nore certain classes of events (e.g., LOCKSET may ignore
events that do not reference memory, such as register-register
ALU operations), propagation-style lifeguards must track the
flow of data through every operation that handles data, and
hence must be triggered for nearly every event.
Consequently, propagation-tracking lifeguards often suffer
high execution overhead. For example, Figure 4 shows a se-
quence of application instructions, the corresponding events,
and the TAINTCHECK operations that would be triggered to
handle them. Except for the two “self” operations, (3) and
(5), an event is delivered for every application instruction.
(See Figure 5 for the full list of propagation-tracking events.)
Note that each TAINTCHECK operation may comprise mul-
tiple instructions. For example, the operation associated with
event (1) must determine the metadata address associated with
application address A, fetch A’s metadata (taint value), deter-
mine the metadata address associated with application register
%eax, and finally store the taint value. Clearly, reducing the
number of events that must be delivered to the lifeguard could
result in significant performance savings.
4.1 Hardware-assisted Propagation Tracking
To improve the performance of propagation-tracking life-
guards, several studies [7, 8, 30, 34] have proposed hardware
designs that automatically track metadata values. Unfortu-
nately, these designs are quite narrow in scope—supporting
only a single lifeguard or, at best, only a particular metadata
size and organization (so that the hardware engine can access
the lifeguard’s metadata without software support). As a re-
sult, even simple modifications to lifeguards that perform well
in their unmodified form, such as adding detailed tracking to
TAINTCHECK, may reduce the lifeguard’s performance from
an acceptable level to a prohibitively low one.
As an alternative to propagating metadata values in hard-
ware, we propose an optimization that tracks the data inher-
itance instead. The difference can be illustrated by returning
to instruction (1) in Figure 4. A value propagation mechanism
would require that hardware retrieve the metadata correspond-
ing to memory address A, and associate it with register %eax.
Inheritance tracking, in contrast, suggests that the hardware
associate %eax with the address A, rather than its metadata.
By separating the tracking of inheritance from the propagation
of metadata values, we support a wider range of lifeguards
because the hardware need not comprehend the metadata or-
ganization. Additionally, the lazy evaluation of the metadata
often enables further optimizations, as we shall see later.
4.2 Unary Inheritance Tracking
Inheritance Tracking is particularly useful for eliminating
events associated with the flow of data through registers. Con-
sequently, an initial sketch for propagation performance ac-
celeration would essentially be a small shadow register file
that associates each architectural register with the addresses
from which it inherits. The challenge with this design is that,
with generic propagation, a particular register could have mul-
tiple ancestors. For example, after instruction (6) in Figure 4,
%eax’s inheritance list contains A, B, and C.
Fortunately, we observe that in many situations, it is suf-
ficient to track unary propagation instead of generic propaga-
tion. Here, unary propagation includes single-source/single-
destination (“copy”) operations (register-to-memory and
memory-to-register), as well as binary computations that use
an immediate value as a source operand. We assume that non-
unary operations (those that combine more than one meta-
data source) propagate a “clean” result to the destination (e.g.,
untainted for TAINTCHECK or initialized for MEMCHECK).
While at first thought this assumption may appear too liberal,
we argue it is valid if (a) the lifeguard reports an error if a
source of a non-unary operation is unclean, or (b) the seman-
tics underlying the metadata values imply that, for all practical
purposes, the result of a non-unary operation is a clean value.
Perhaps surprisingly, both MEMCHECK and TAINTCHECK
are candidates for unary inheritance tracking as MEMCHECK
satisfies property (a) and TAINTCHECK satisfies property (b).
MemCheck. MEMCHECK monitors memory references to
detect the use of unallocated locations or uninitialized values.
To avoid false positives associated with the harmless copying
of uninitialized memory values (e.g., in a padded struct),
MEMCHECK does not report an error on the first load of an
uninitialized value, but rather tracks the propagation of such
values until one is used for a pointer dereference, conditional
test input, or system call input [18, 19]. While this lazy evalu-
ation of uninitialized value use is sufficient to eventually catch
use of uninitialized memory, we argue that an eager eval-
uation that flags the first use of an uninitialized value in a
non-unary computation is equally valid (e.g., flagging when
uninitialized values are added). Our modified MEMCHECK
checks the source operands of non-unary operations (identify-
ing any uninitialized values as errors) and treats the destina-
tion operands as initialized, in order to avoid a cascading of
error reports all based on the same uninitialized value.
TaintCheck. TAINTCHECK tracks the propagation of mem-
ory taint values to detect memory overwrite-based security
exploits [21], and for all practical purposes, unary propaga-
tion appears to suffice for detecting such exploits. This claim
is supported through four observations. First, the security
literature [6, 37] reports that overwrite attacks (e.g., buffer
overflow) rely almost exclusively on direct copying. Second,
third-party analysts [31] often identify overwrite-based secu-
rity vulnerabilities in proprietary software by causing a soft-
ware crash through the introduction of a long input composed
of a known pattern (e.g., repeating 0x55). A vulnerability is
identified if the pattern is observed in expected locations in
the core dump. This technique relies on a direct (unary) prop-
agation of the input, and any identified vulnerability will be
detected under unary-only propagation. Third, to empirically
evaluate these claims, we analyzed the first six months of CVE
security alert entries in 2007 [31]. For the entries involving
open source software, we studied the source code patches and
found that every memory overwrite vulnerability was due to
unary propagation. Finally, while there is always a concern
that attackers can specifically work around unary-only prop-
agation, note that TAINTCHECK identifies attacks (including
format string and function pointer attacks) before any attack-
ing code executes. Thus, the attack is constrained to exploit
the original application code, not any injected code. (This task
could be made even more challenging by providing applica-
tion developers with static analysis tools that identify when
M
U
X
State Transition
and Action Table
Conflict Detection
IT table with
transformed
deliver
event
event
IT(%rd)IT(%rs)
event
original mem_to_reg
conflict?
IT state addr size addr&~3 4bits (addr&~3)+4 4bits
used for conflict detectionIT table
IT state: 00 − clear, 01 − addr, 10 − in lifeguard
reg1
regN
example app. operation
Original Event
imm_to_reg
mov $imm, %rd
imm_to_mem
mov $imm, mem(daddr)
reg_self
op $imm, %rd
mem_self
op $imm, mem(daddr)
reg_to_reg
mov %rs, %rd
reg_to_mem
mov %rs, mem(daddr)
mem_to_reg
mov mem(saddr), %rd
mem_to_mem
copy saddr to daddr
dest_reg_op_reg
op %rs, %rd
dest_reg_op_mem
op mem(saddr), %rd
dest_mem_op_reg
op %rs, mem(daddr)
other
other instructions
do nothing
IT(%rd) := clean
do nothing
do nothing
IT(%rd) := IT(%rs)
imm_to_mem
IT(%rd) := saddr
IT(%rd) := clean
do nothing imm_to_mem
mem_to_mem
addr
IT(%rd):=IT(%rs)
reg_to_reg
in lifeguard
Starting IT(%rs) State
clean
imm_to_mem
reg_to_mem
mem_to_mem
IT(%rd) := clean
flush all relevant registers’ IT to "in lifeguard"
imm_to_reg/mem_to_reg,by delivering
then deliver other.
State Transition and Action Table
Figure 5. A unary inheritance tracking design.
the structure of their code provides an opportunity for a non-
unary attack.) For these reasons, we believe that assuming
taintedness does not propagate through non-unary operations
represents a good performance/coverage trade-off.
By limiting the inheritance tracking to unary propagation,
each register in the inheritance table described above can be
associated with at most one source—making such a struc-
ture feasible. Moreover, because non-unary operations act
as “sinks” for propagation (the result is always clean), many
propagation events can be eliminated. In Figure 4, for ex-
ample, unary propagation removes the need for accessing the
metadata for A, B, and C.
4.3 A Unary Inheritance Tracking Design
Figure 5 presents a hardware design for unary inheritance
tracking. The events are chosen to support the IA32 archi-
tecture. Note, however, that one can remove several mem-
ory related events to support RISC architectures. We use a
table, IT, to hold inheritance information for each general-
purpose register. Each entry either specifies the memory ad-
dress from which the register inherits or indicates that the reg-
ister is “clean”. (A third state, “in lifeguard”, is discussed be-
low.) For each incoming original propagation event, we look
up the state transition and action table using the event type and
the state of the source register. The action is either to update
an entry in the IT table, to deliver an event to the lifeguard,
or to simply discard the event. For example, if the event is a
reg to reg type (mov %rs, %rd) and IT(%rs) holds an ad-
dress addr, then IT(%rd) is updated to addr.
Three issues complicate the picture somewhat. First, be-
cause the inherited metadata values are lazily evaluated, a
write-after-read conflict may occur if a “store to A” event (e.g.,
imm to mem) arrives when a register r currently inherits from
A. A problem arises if the store event were delivered, up-
dating A’s metadata, and later a reg to mem event were to
propagate r’s metadata to another location B’s metadata. B’s
metadata should inherit the previous value of A’s metadata,
which had been overwritten. We solve this potential problem
by detecting the conflict and delivering a mem to reg event
to the lifeguard just prior to the “store to A” event, so that the
lifeguard can appropriately maintain A’s previous metadata as
the current state of r. IT (r) is then set to “in-lifeguard”, indi-
cating that r’s metadata is now maintained in software (until
the next overwriting of register r).
Second, architectures such as IA32 support unaligned
memory accesses and accesses with multiple sizes. To be
conservative, we would like the conflict detection mechanism
to match any recorded memory accesses that have overlap-
ping ranges with the current store. As shown in Figure 5, the
four rightmost columns in the IT table store a pair of 4-byte
aligned addresses with bitmaps indicating used bytes. An in-
coming store can first match its 4-byte aligned addresses and
then check the bitmaps to determine if there is a conflict.
Third, there may be instructions not represented by the ex-
plicit event types in Figure 5 (e.g., xchg in IA32). Since these
instructions are typically less frequent, we deliver an other
event where the lifeguard software can analyze the instruction
record in a slow path to determine the appropriate action. Be-
fore delivering an other event, our unary tracking hardware
flushes the IT state of relevant registers to ensure that the life-
guard has all the relevant up-to-date metadata states.
Moreover, we have optimized the state transition for binary
operations with known “clean” %rs. In such cases, we choose
to “do nothing”, leaving the destination’s metadata unmodi-
fied, which follows generic propagation.
In summary, because many of the events are processed
solely by the IT mechanism and not by the lifeguard, the life-
guard overhead can be significantly reduced. For example,
in Figure 4, IT reduces the number of delivered events from
seven to two, as shown on the right of the figure.
5 Idempotent Filters for Checking Events
Checking metadata states upon observing certain applica-
tion events is a fundamental operation of any lifeguard. While
some lifeguards perform only a modest number of checks
(e.g., TAINTCHECK), others perform checks very frequently
(e.g., ADDRCHECK, MEMCHECK, and LOCKSET logically
check every memory operation). However, many checks are
idempotent (redundant). For example, once ADDRCHECK
checks that a memory location is allocated, subsequent loads
and stores to the same address need not be checked—until the
next free() event. In this section, we present an Idempotent
Filter design for reducing lifeguard checking overhead.
The idea is to introduce a lifeguard-configurable IF cache
of recently observed checking events. If an incoming event
hits in the cache, it is discarded (filtered). Upon a miss in
the IF cache, an event E is delivered to the lifeguard. If E
is configured to be cacheable, it is inserted into the IF cache
with the LRU replacement policy.
Since different lifeguards have different checking require-
ments, the IF hardware extends the event type configuration
table (ETCT), which specifies event handler addresses, to in-
clude the following fields that control the IF behavior. First,
a cacheable bit specifies whether the lifeguard classifies the
event as checking-only (non-updating). If set, events of that
type can be filtered by the IF cache. Second, a “check catego-
rization” (CC) field enables lifeguards to specify whether two
event types result in the same checks (such as load and store
events in ADDRCHECK). Third, there is a cacheable bit for ev-
ery field of the instruction record. A line in the IF cache con-
sists of the CC value and the set of selected record field values.
The line is indexed by a hash code computed from the entire
line. If the CC value and the selected fields of an incoming
event match an existing cache entry, the IF hardware consid-
ers it a “hit” and assumes that the two events are idempotent.
For example, ADDRCHECK would use the same CC value for
load and store event types, and specify that the memory ad-
dress and the size fields are cacheable. MEMCHECK employs
IF similarly for accessibility checking. In contrast, LOCKSET
must treat load and store operations as separate with respect
to filtering by using different categorization values for them
(surprisingly, IF does apply to LOCKSET despite only being
able to observe a single thread at a time1).
Moreover, the ETCT specifies invalidation policies for the
IF cache. Note that checks are only idempotent as long
as the underlying metadata remain unmodified. If the rele-
vant metadata changes, cached checks must be invalidated.
We further augment the ETCT with two bits: one indicating
whether an event of this type invalidates the entire IF cache
(e.g., malloc/free calls or system calls), and one indicating
whether the event invalidates records that match the specified
CC value and selected fields of the event.
Perhaps most interestingly, we find that relatively small
cache sizes (e.g., 32-entry) and associativity (e.g., 4-way) are
remarkably useful for idempotent filtering. Extensive results
are presented in Section 7.
6 Reducing the Cost of Metadata Mapping
As described in Section 2.1, instruction-grain lifeguards
typically keep metadata for every byte or word in the address
space of the monitored applications, and typically, this meta-
data is consulted and/or updated for each (unfiltered) mem-
ory reference event. Because metadata accesses are so fre-
quent, providing a fast translation from application addresses
1For LOCKSET, we choose to invalidate the filter for all annotation
records (including lock/unlock). Consider two redundant accesses to the same
memory location m from the same thread t. Because there is no lock/unlock
in between, t’s lockset St at the two accesses must be the same. Intersecting
m’s candidate set Sm with the same St twice does not shrink Sm further.
Lifeguard Metadata Designs
One−Level Two−Level
MetadataMetadata
code
global data
heap
mmaped
stack
library
Application
unused
Figure 6. Two alternative lifeguard metadata designs. (We focus
on the more flexible two-level design.)
to metadata addresses is essential for good lifeguard perfor-
mance.
6.1 Lifeguard Metadata Organizations
From a survey of previous studies [7, 8, 17, 18, 21, 25, 27,
30, 35], we have identified the two possible metadata organi-
zations shown in Figure 6. In the one-level design, a single
contiguous metadata region represents a (possibly scaled) di-
rect translation of the entire application address space. While
this organization enables address translation through a simple
scale-and-offset mechanism, it presents several drawbacks.
First, sparsely-used application address spaces consume
an unnecessarily large portion of the lifeguard address space.
Second, because both extremes of an application’s virtual ad-
dress space are typically occupied, the one-level design is only
viable when metadata consume less space than normal data
(e.g., 1 taint bit representing 1 byte), limiting its applicabil-
ity. Third, the waste of metadata address space is particularly
challenging when the application and lifeguard must share the
same address space as in many proposals (including DISE); in
face of the challenges of getting the application data, lifeguard
metadata, lifeguard auxiliary data, and the lifeguard code to fit
into a single virtual address space, robustness and portability
considerations2 often favor the two-level design [19].
The two-level design avoids these drawbacks by employ-
ing an indexing structure similar to a page table to perform the
translation between application addresses and metadata ad-
dresses. This structure is clearly more space-efficient and flex-
ible. The two-level design also requires fewer modifications
to the normal loader, runtime, and OS behavior because the
lifeguard can make better use of its available address space.
Because our goal is to support a wide range of lifeguards, we
consider the two-level design to be the baseline configuration.
6.2 Lifeguard-managed Address Translation
The one negative attribute of the two-level structure is per-
formance, as the extra level of indirection requires additional
lifeguard instructions and memory references. Figure 7 shows
a representative event handler in TAINTCHECK, which com-
bines the taint of a memory location and a register. The origi-
nal C code is shown on the left along with the generated IA32
assembly instructions. Of the eight instructions, the first five
performs metadata mapping, which is over half of the instruc-
tions in this handler! Our goal is to achieve the advantages of
the two-level design while minimizing the cost.
2particularly given (1) large, uncooperative programs, or (2) tools which
target multiple operating systems, and/or diverse hardware platforms.
Noting that the two-level metadata structure resembles
page-tables, a hardware TLB mechanism suggests itself as a
possible solution to the performance problem. Rather than
translating virtual addresses to physical ones, however, such
a mechanism would translate application-space virtual ad-
dresses to lifeguard-space (metadata) virtual addresses. Note
that, because the output of the mechanism are lifeguard-space
virtual addresses, the lifeguard can be permitted to manage the
structure directly—minimizing (if not obviating) the need for
OS support. In fact, a software-managed TLB [13] augmented
to handle user-selectable metadata structure sizes would mini-
mize the negative performance impact of the two-level design
while preserving its flexibility.
6.3 The LMA (Load Metadata Address) Instruction
In the context of the IA32 architecture, we propose ex-
posing a TLB-like address translation mechanism, which we
call Metadata-TLB (M-TLB), through a new instruction, LMA
(Load Metadata Address), which translates an application ad-
dress directly to a metadata address. Such a mechanism en-
ables us to replace the first five instructions on the left side
of Figure 7 with a single LMA instruction on the right side,
thus reducing the handler’s instruction count by half. Figure 8
describes the complete extension to the architecture.
Figure 9 depicts an M-TLB design. A lifeguard en-
ables LMA support by configuring the metadata mapping with
lma config. Because the M-TLB is software managed (by the
user-space lifeguard), a miss handler address is specified. An
application data address is broken down into three parts: the
highest part is the level1 index, the middle part is the level2
index, the lowest part is the index into each level2 element.
The LMA config register records the number of level1 and
level2 bits as well as the size of level2 elements. As shown
in Figure 9, an lma instruction converts an application data
address into a metadata address using a fast hardware lookup
table (CAM). If successful, it takes one cycle. However, if the
mapping is not found, the application address is pushed onto
the stack and the software miss handler is called, which uses
lma fill to insert the missing mapping. When the miss handler
returns, the lma instruction is re-executed.
We make the following three design choices. First, to sup-
port a wide range of lifeguards, our design is flexible in terms
of the number of bits in level1 and level2 index and the ele-
ment size. In fact, an lma config flushes the M-TLB, and can
be used to dynamically configure the mapping. Section 7.3
shows that such flexibility can significantly reduce M-TLB
miss rates. Second, to reduce hardware complexity, LMA only
performs address translation; it does not issue any memory
accesses for metadata. Third, LMA only obtains the starting
(byte) address of a level2 element; determining the offset of
fields within an element (e.g., which bit corresponds to the
taint of a given application byte address) is the responsibil-
ity of the lifeguard. In our experiences, this does not incur
significant overhead because lifeguards can often use a level2
element as the most frequent metadata size, as in Figure 7.
7 Experimental Evaluations
We evaluate the performance benefits and design choices
of our framework through both a simulation and a profiling
// mov %eax, %ecx
// shr $16,  %ecx
// mov level1_index(,%ecx,4), %ecx
map *mp = level1_index[src_addr>>16];
// and $0xffff, %eax
// shr $2, %eax
int idx = (src_addr & 0xffff) >> 2;
// movzbl (%ecx,%eax,1), %eax
UChar mem_taint = mp[idx];
// or %a1, reg_taint(%edx)
reg_taint[dest_reg] |= mem_taint;
// nlba
next_lba_record();
UChar *p = LMA_macro(src_addr);
// LMA %eax, %ecx
// mov (%ecx), %al
UChar mem_taint = *p;
next_lba_record();
// nlba
reg_taint[dest_reg] |= mem_taint;
// or %al, reg_taint(%edx)
/* Applying LMA */
void dest_reg_op_mem_4B(UINT32 src_addr/*in %eax*/, UINT32 dest_reg/*in %edx*/)
// App instruction event: dest_reg = dest_reg op mem(src_addr)
// Handler performs:      reg_taint(dest_reg) |= mem_taint(src_addr)
Figure 7. Applying LMA to a TAINTCHECK event handler. (Two-level metadata structure: 16-bit
level1 index, 14-bit level2 index, 2-bit in-byte offset. 2-bit tainted metadata is used so that the
frequent 4-byte operations on IA32 are handled with 1-byte metadata accesses.)
lma config $imm, $miss
Set LMA config register
and miss handler
lma %rs, %rt
map an application address
(%rs) into metadata
address (%rt)
lma fill %ra, %rb
fill M-TLB with an entry
given by application
address (%ra) and metadata
address (%rb)
Figure 8. New instructions
for LMA.
AND
level1 index (20 bits) level2 start address (32 bits)
AND
SHIFT
00: 1−byte
01: 2−byte
10: 4−byte
11: 8−byte
ADD
shift = 32 − level1_bits − level2_bits
− level2_element_size
upper 20 bits
Level1 mask
0xffff0
0xb3fb0
2
32 bits
Level2 mask
0x0000fffc
0x00007038
all 32 bits
0x00001c0e
metadata address (to %rt)
20 bits
Fast lookup table
00
2
0111010000reserved
31 12 11 7 6 01
level1 level2 level2
# of bits # of bits element
size
register
LMA config
0x08046000
start address
0x08047c0e
invoke miss handler
Application data address (from %rs)
0xb3fb703a
at lma_config
at lma
Legend
Configured
Example
Figure 9. A 32-bit LMA hardware design illustrated with the TAINTCHECK example in Figure 7.
study. In the simulation study, we perform full-system simu-
lations of five diverse instruction-grain lifeguards monitoring
CPU-intensive benchmarks. However, due to simulation time
constraints, reduced input sets are used. In contrast, we use
the full-scale inputs in the profiling study, and investigate the
impact of design choices on miss rates and filtered events.
7.1 Experimental Setup
Simulation Study. We implemented LBA by extending the
Simics [36] full-system simulation platform with log record
capture and event dispatch support. As described in Sec-
tion 3, we extend our original LBA design with the capability
of delivering multiple (hardwired) events per log record. The
three proposed techniques, namely Inheritance Tracking (IT),
Idempotent Filters (IF), and Load Metadata Address (LMA),
are implemented in the event dispatch module, and are indi-
vidually configurable by the lifeguard software.
Table 2 describes the simulation setup. We model a dual-
core IA32 system with LBA, running an application on one
core and a lifeguard monitoring the application on a second
core. Because the choice of ISA may significantly affect the
mix of monitored events, we choose the currently most pop-
ular ISA, IA32, in our study. Lacking a complete, publicly
available model for out-of-order IA32 simulations, we model
in-order cores. We simulate a two-level cache hierarchy with
private L1 caches and a shared L2 cache, which contains a
64KB log buffer. The log buffer accesses are modeled as fol-
lows. To reduce L1 interference, a one-cache-line (64-byte)
record buffer is used at both cores for caching compressed log
records. The producer core writes an entire line of records to
the L2 cache when the 64-byte buffer is filled. Likewise, the
consumer core reads an entire line of records at a time.
In the simulation, we assume an 8-entry IT table (for 8
general-purpose registers), a 32-entry fully-associative cache
for the IF filter, and a 1-cycle latency for the LMA instruction.
We explore the design space in the profiling study.
Profiling Study. We instrument the benchmark executables
with PIN [14] to obtain memory access, propagation, and ad-
dress computation events. We build three modules that model
the IT, IF, and LMA mechanisms, respectively. The modules
take the events as input and collect statistics on miss rates and
filtered events. We vary the design parameters of the tech-
niques to explore the design space.
Lifeguard Implementations. We implemented five diverse
lifeguards as shown in Figure 2. In addition to the common
TAINTCHECK supported by previous lifeguard-specific hard-
ware DIFT proposals [7, 8, 30, 34], we implemented a version
of TAINTCHECK enhanced with detailed tracking. It keeps
8-byte metadata for every 4-byte application word. When a
memory location is tainted by an instruction, the enhanced
TAINTCHECK records the 4-byte “from” memory address and
the 4-byte eip. In this way, a taint propagation trail can be
reconstructed upon a security violation.
Table 2. Simulation Setup.
Simulator description
Simulator Virtutech Simics 3.0.17
Extensions Log capture and dispatch
Target OS Fedora Core 5 for IA32
Processor core Dual-core, in-order scalar
Cache simulation g-cache module
Simulation parameters
Private L1I 16KB, 64B line, 2-way assoc, 1-cycle access lat.
Private L1D 16KB, 64B line, 2-way assoc, 1-cycle access lat.
Shared L2 512KB, 64B line, 8-way, 10-cycle access lat., 4 banks
Main Memory 200-cycle latency
Log buffer 64KB, assuming 1B per compressed record [3]
Table 3. Multithreaded Benchmarks for LOCKSET.
Benchmark Description and Input
blast v2.2.16 [16] Searching a nucleotide and protein
database of 134K sequences
pbzip2 v1.0.1 [22] Parallel data compressor, compress half
of CPU2000’s ref input.source
pbunzip2 v1.0.1 [22] Decompress pbzip2’s output in parallel
water-nq SPLASH-2 [38] Water simulation, 343 molecules
zchaff 2002.7.15 [23] SAT (Boolean Satisfiability Problem)
solver, circuit fault analysis
We strive to write the lifeguard code as optimized as possi-
ble. First, as shown previously in Figure 4, we use 2-bit meta-
data per application byte for the common TAINTCHECK. In
this way, the frequent 4-byte operations in IA32 applications
correspond to 1-byte metadata accesses, avoiding sub-byte ac-
cess costs. MEMCHECK employs a similar scheme. Second,
we optimize the code path for the frequent (stable) cases in
metadata checking. For example, given a memory access, our
LOCKSET implementation first checks if the 32-bit metadata
is one of the recently seen stable cases3 before performing the
expensive lockset intersection operation. Third, we compile
the lifeguards with gcc where event handlers are C functions
with register parameters. We examine the disassembly of the
frequent code path (typically less than 10 instructions) for the
frequent event handlers. In cases that we see suboptimal code,
we hand optimize the assembly code.
Benchmarks and Inputs. We choose CPU-intensive bench-
marks to “stress test” instruction-grain monitoring. In the sim-
ulation study, we use the SPEC2000 integer benchmarks to
evaluate all but the LOCKSET lifeguard. Because of the sim-
ulation time constraints, the test inputs are used. In contrast,
we explore the design space in the profiling study with the
ref inputs for the SPEC2000 integer benchmarks. For the data
race detection lifeguard, LOCKSET, we use five multithreaded
benchmarks in the simulation study, as shown in Table 3. Each
benchmark spawns two working threads that are restricted to
run on core 1 using Linux’s sched setaffinity call. Note
that running multiple application threads on separate cores re-
quires synchronizing both event delivery and lifeguard meta-
data access, and is beyond the scope of the paper. All bench-
marks were run to completion in both the simulation and the
profiling study.
7.2 Performance Study using Simulations
We evaluate the performance benefits of our framework us-
ing five diverse lifeguards. Figure 10 reports lifeguard perfor-
mance with and without our proposed optimizations. The Y-
axis reports the slowdowns of benchmark runs monitored by a
lifeguard compared to benchmark runs without lifeguards. We
see that compared to the LBA baseline bar, our techniques dra-
3That is, Sm ∩ St = Sm, hence the metadata does not change.
matically reduce the slowdowns for all the lifeguards, achiev-
ing less than 51% overhead on our challenging benchmarks
for all but the MEMCHECK lifeguard.
The still poor performance of MEMCHECK is not sur-
prising. Like ADDRCHECK, MEMCHECK checks memory
accesses. Like TAINTCHECK, it propagates metadata. In
addition, MEMCHECK also monitors other instruction-grain
events, such as address computation for checking pointer
dereference. Therefore, the events monitored by MEMCHECK
is a super-set of ADDRCHECK and TAINTCHECK. Its slow-
down is expected to be larger than the sum of ADDRCHECK’s
slowdown and TAINTCHECK’s slowdown.
Figure 11 shows the performance improvement by apply-
ing one by one our three techniques (LMA for metadata map-
ping, IT for metadata updates, and IF for metadata checks).
The Y-axis reports the average slowdowns across the bench-
marks. As described in Figure 2, LMA is applicable to all
the lifeguards, while IT and IF are relevant to a subset of
the lifeguards. Moreover, IF is a flexible mechanism where
a lifeguard can configure which event to filter and how to
invalidate the filter. Both ADDRCHECK and LOCKSET re-
quire that memory accesses are filtered and the filter is in-
validated at high-level events. However, they differ in that
ADDRCHECK does not distinguish memory loads and stores
while LOCKSET does. Like ADDRCHECK, MEMCHECK also
checks memory accesses therefore employing the same filter.
In addition, MEMCHECK also checks the base and index reg-
isters of address computations.
From Figure 11, we see that applying an additional tech-
nique achieves significant performance gains in all cases; the
three techniques are complementary. This is because they are
targeting different aspects of the lifeguard overhead. LMA re-
duces the average event handler lengths by replacing multiple
metadata mapping instructions with a single LMA instruction.
As shown in Figure 12, it reduces the total lifeguard dynamic
instruction count by 16.7%–49.3%. IT and IF complement
LMA by reducing the event frequency for update events and
checking events, respectively. As shown in Figure 12, IT and
IF filter out 24.9%–77.8% of the update and checking events.
7.3 Exploring Design Space using PIN Analysis
IT. We use PIN to instrument the SPEC2000 integer bench-
marks with ref input. Figure 13(a) shows the percentage of
reduced propagation events using Inheritance Tracking. IT
removes 35.8%–82.0% of the propagation events.
IF. Figure 13(b) and (c) shows the average percentage of re-
duced check events by applying Idempotent Filters on the
SPEC benchmarks with ref input while varying the number
of filter entries and the set associativity of the filter design.
Figure 13 (b) and (c) correspond to accessibility checking in
ADDRCHECK and data race checking in LOCKSET, respec-
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
bz
ip
2
cr
af
ty
eo
n
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
Av
g
s
lo
w
d
o
w
n
s
AddrCheck
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
bz
ip
2
cr
af
ty
eo
n
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
A
vg
s
lo
w
d
o
w
n
s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
bz
ip
2
cr
af
ty
eo
n
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
A
vg
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
bl
as
t
pb
un
zi
p2
pb
zi
p2
w
at
er
zc
ha
ff
Av
g
0.0
2.0
4.0
6.0
8.0
10.0
12.0
bz
ip
2
cr
af
ty
eo
n
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
Av
g
LBA Baseline
LBA Optimized 
MemCheck
TaintCheck TaintCheck w/ Detailed Tracking LockSet
Figure 10. Performance benefits of our framework on five diverse lifeguards.
3
.2
3
1
.9
0
1
.0
2
7
.8
0
6
.0
5
3
.8
1
3
.2
7
3
.3
6
2
.2
9
1
.3
6
4
.2
1
2
.7
1
1
.5
1
4
.2
5
3
.2
0
1
.4
0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
B
A
S
E
L
M
A
L
M
A
+
IF
B
A
S
E
L
M
A
L
M
A
+
IT
L
M
A
+
IT
+
IF
B
A
S
E
L
M
A
L
M
A
+
IT
B
A
S
E
L
M
A
L
M
A
+
IT
B
A
S
E
L
M
A
L
M
A
+
IF
a
v
e
ra
g
e
 s
lo
w
d
o
w
n
s
AddrCheck MemCheck TaintCheck TaintCheck 
w/ Detailed 
Tracking
LockSet
Figure 11. Applying our three techniques one by one.
LMA: Reduced IT: Reduced IF: Reduced
dynamic instr update events check events
ADDRCHECK 28.9%–49.3% – 38.2%–65.1%
MEMCHECK 16.7%–23.6% 24.9%–59.7% 38.2%–62.8%
TAINTCHECK 20.2%–42.4% 37.4%–74.4% –
TAINTCHECK w/ 38.9%–44.3% 37.4%–74.4% –
detailed tracking
LOCKSET 18.2%–39.0% – 43.5%–77.8%
Figure 12. Statistics on reduced instructions and events across
the benchmarks.
0
10
20
30
40
50
60
70
80
90
100
b
z
ip
2
c
ra
ft
y
e
o
n
g
a
p
g
c
c
g
z
ip
m
c
f
p
a
rs
e
r
tw
o
lf
v
o
rt
e
x
v
p
r
re
d
u
c
e
d
 u
p
d
a
te
 e
v
e
n
ts
 (
%
)
0
10
20
30
40
50
60
70
80
8 16 32 64 128 256
number of filter entries
re
d
u
c
e
d
 c
h
e
c
k
 e
v
e
n
ts
 (
%
)
fully-assoc
16-way
8-way
4-way
2-way
1-way
0
10
20
30
40
50
60
70
80
8 16 32 64 128 256
number of filter entries
re
d
u
c
e
d
 c
h
e
c
k
 e
v
e
n
ts
 (
%
)
(a) IT (b) IF for combined loads and stores (c) IF for separate loads and stores
Figure 13. IT and IF filtering for metadata updates and checks with PIN-based analysis.
parser(12) 0.0055 0 0
twolf(11) 0 0 0
vortex(15) 0.403 0.0936 0.0128
vpr(10) 0 0 0
0
1
2
3
4
5
6
7
8
9
20 19 18 17 16 15 14 13 12 11 10 9 8
number of level 1 bits
M
-T
L
B
 m
is
s
 r
a
te
 (
%
)
16-max
32-max
64-max
128-max
256-max
16-avg
32-avg
64-avg
128-avg
256-avg
bzip2(20) 1.8832 0.9412 0.2563
crafty(20) 4.1975 0.4897 0.0181
eon(20) 0.0531 0 0
gap(20) 1.3378 0.0891 0.0386
gcc(20) 0.1299 0.0402 0.0142
gzip(20) 2.0729 0.002 0.0013
mcf(20) 8.4477 7.0698 5.2523
parser(20) 3.5168 0.5706 0.1223
twolf(20) 2.5142 1.058 0.0422
vortex(20) 1.6045 0.3225 0.1565
vpr(20) 3.843 1.3543 0.1814
variable level 1 bits (using up to 1% total pages or up to 10% space increase)
bzip2(12) 0 0 0
crafty(10) 0 0 0
eon(10) 0 0 0
gap(11) 0.0027 0 0
gcc(12) 0 0 0
0
1
2
3
4
5
6
7
8
9
b
z
ip
2
(2
0
)
c
ra
ft
y
(2
0
)
e
o
n
(2
0
)
g
a
p
(2
0
)
g
c
c
(2
0
)
g
z
ip
(2
0
)
m
c
f(
2
0
)
p
a
rs
e
r(
2
0
)
tw
o
lf
(2
0
)
v
o
rt
e
x
(2
0
)
v
p
r(
2
0
)
b
z
ip
2
(1
2
)
c
ra
ft
y
(1
0
)
e
o
n
(1
0
)
g
a
p
(1
1
)
g
c
c
(1
2
)
g
z
ip
(1
1
)
m
c
f(
1
3
)
p
a
rs
e
r(
1
2
)
tw
o
lf
(1
1
)
v
o
rt
e
x
(1
5
)
v
p
r(
1
0
)
M
-T
L
B
 m
is
s
 r
a
te
 (
%
)
16-entry
64-entry
256-entry
Flexible Level-1 bitsFixed Level-1 bits
(a) Varying number of level-1 bits (b) Fixed-bit vs. variable-bit design (level-1 bits in parenthesis)
Figure 14. Exploring the design space of M-TLB with PIN-based analysis.
tively. The difference is that the latter treats loads and stores
separately while the former does not distinguish loads and
stores. We see that IF can effectively reduce the number of
check events, and a set-associative design with 4 or more ways
works as well as the fully-associative design.
M-TLB. Figure 14(a) varies the number of level-1 bits from
20 to 8 bits in M-TLB and varies the number of M-TLB entries
from 16 to 256. For every configuration, the maximum and
the average miss rates of all benchmarks are depicted. As the
number of level-1 bits decreases, the total number of level-1
entries decreases exponentially, thus resulting in the dramatic
decrease of miss rates in the figure. On the other hand, life-
guard space overhead increases as level-1 bits decrease be-
cause small holes in the application space can no longer be
distinguished. In Figure 14(b), we explore this trade-off and
compare a fixed-level-1 design (left) with our flexible design
(right). For the latter, the level-1 bits are chosen so that ei-
ther the lifeguard space grows less than 10% or the lifeguard
uses up to 1% of the total 32-bit address space (assuming a
1-1 mapping from application byte to metadata byte). We see
that our flexible design dramatically reduces the M-TLB miss
rates. In most cases, the miss rates are negligible.
8 Conclusion
This paper presented a hardware acceleration framework
for a wide range of instruction-grain lifeguards. Three novel
techniques—Inheritance Tracking, Idempotent Filters, and
Metadata-TLBs—are shown to significantly reduce the ma-
jor sources of overheads in such lifeguards. For all the life-
guards studied (but MEMCHECK), we found that these tech-
niques reduced the overheads of LBA from 3.2–4.2X down
to 1.02–1.5X on challenging CPU-intensive benchmarks. Our
future work seeks to reduce these overheads further and also
to study the performance on less challenging memory-bound
benchmarks. The negligible overheads for our sole memory-
bound benchmark (mcf) are encouraging in this regard.
Acknowledgments. This work is supported by grants from
the National Science Foundation and from Intel. Ramachan-
dran is supported in part by NSF grant CCF-0514876.
We thank Anastassia Ailamaki, Limor Fix, Greg Ganger,
Michelle Goodstein, Bin Lin, and Radu Teodorescu for their
contributions and inputs to the LBA project.
References
[1] D. Bruening. Efficient, transparent, and comprehensive runtime code
manipulation. PhD thesis, MIT, 2004.
[2] W. R. Bush, J. D. Pincus, and D. J. Sielaff. A static analyzer for find-
ing dynamic programming errors. Software – Practice and Experience,
30(7), 2000.
[3] S. Chen, B. Falsafi, P. B. Gibbons, M. Kozuch, T. C. Mowry, R. Teodor-
escu, A. Ailamaki, L. Fix, G. R. Ganger, B. Lin, and S. W. Schlosser.
Log-based architectures for general-purpose monitoring of deployed
code. In ASID Workshop at ASPLOS, 2006.
[4] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical
study of operating systems errors. In SOSP, 2001.
[5] M. L. Corliss, E. C. Lewis, and A. Roth. DISE: A programmable macro
engine for customizing applications. In ISCA, 2003.
[6] C. Cowan, C. Pu, D. Maier, J. Walpole, P. Bakke, S. Beattie, A. Grier,
P. Wagle, Q. Zhang, and H. Hinton. StackGuard: Automatic adaptive
detection and prevention of buffer-overflow attacks. In USENIX Secu-
rity, 1998.
[7] J. R. Crandall and F. T. Chong. Minos: Control data attack prevention
orthogonal to memory model. In MICRO, 2004.
[8] M. Dalton, H. Kannan, and C. Kozyrakis. Raksha: A flexible informa-
tion flow architecture for software security. In ISCA, 2007.
[9] D. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules us-
ing system-specific, programmer-written compiler extensions. In OSDI,
2000.
[10] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynami-
cally discovering likely program invariants to support program evolu-
tion. IEEE TSE, 27(2), 2001.
[11] C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson, J. B. Saxe, and
R. Stata. Extended static checking for Java. In PLDI, 2002.
[12] S. A. Hofmeyr, S. Forrest, and A. Somayaji. Intrusion detection using
sequences of system calls. J. of Computer Security, 6(3), 1998.
[13] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice-Hall, 1992.
[14] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wal-
lace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program
analysis tools with dynamic instrumentation. In PLDI, 2005.
[15] S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously
recording program execution for deterministic replay debugging. In
ISCA, 2005.
[16] National Center for Biotechnology Information (NCBI). Basic local
alignment search tool (BLAST). ftp://ftp.ncbi.nih.gov/blast/.
[17] N. Nethercote. Dynamic binary analysis and instrumentation. PhD
thesis, U. Cambridge, 2004. http://valgrind.org.
[18] N. Nethercote and J. Seward. Valgrind: A program supervision frame-
work. Electronic Notes in Theoretical Computer Science, 89(2), 2003.
[19] N. Nethercote and J. Seward. How to shadow every byte of memory
used by a program. In VEE, 2007.
[20] N. Nethercote and J. Seward. Valgrind: A framework for heavyweight
dynamic binary instrumentation. In PLDI, 2007.
[21] J. Newsome and D. Song. Dynamic taint analysis for automatic de-
tection, analysis, and signature generation of exploits on commodity
software. In NDSS, 2005.
[22] Parallel Bzip2. http://compression.ca/pbzip2/.
[23] Princeton Zchaff. http://www.princeton.edu/∼chaff/zchaff.html.
[24] N. Provos. Improving host security with system call policies. In
USENIX Security, 2003.
[25] F. Qin, C.Wang, Z. Li, H. Kim, Y. Zhou, and Y. Wu. LIFT: A low-
overhead practical information flow tracking system for detecting secu-
rity attacks. In MICRO, 2006.
[26] F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as
allergies - a safe method to survive software failures. In SOSP, 2005.
[27] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson.
Eraser: A dynamic race detector for multi-threaded programs. ACM
TOCS, 15(4), 1997.
[28] R. Shetty, M. Kharbutli, Y. Solihin, and M. Prvulovic. Heapmon: A
helper-thread approach to programmable, automatic, and low-overhead
memory bug detection. IBM J. on Research and Development, 50(2/3),
2006.
[29] W. Shi, H.-H. S. Lee, L. Falk, and M. Ghosh. An integrated framework
for dependable and revivable architectures using multicore processors.
In ISCA, 2006.
[30] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas. Secure program exe-
cution via dynamic information flow tracking. In ASPLOS, 2004.
[31] The MITRE Corporation. Common vulnerabilities and exposures
(CVE). http://cve.mitre.org/.
[32] J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing
production run failures at the user’s site. In SOSP, 2007.
[33] G.-R. Uh, R. Cohn, B. Yadavalli, R. Peri, and R. Ayyagari. Analyz-
ing dynamic binary instrumentation overhead. In WBIA Workshop at
ASPLOS, 2006.
[34] G. Venkataramani, I. Doudalis, Y. Solihin, and M. Prvulovic. Flexi-
Taint: A programmable accelerator for dynamic taint propagation. In
HPCA, 2008.
[35] G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. Mem-
Tracker: Efficient and programmable support for memory access mon-
itoring and debugging. In HPCA, 2007.
[36] Virtutech Simics. http://www.virtutech.com/.
[37] J. Wilander and M. Kamkar. A comparison of publicly available tools
for dynamic buffer overflow prevention. In NDSS, 2003.
[38] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The
SPLASH-2 programs: Characterization and methodological consider-
ations. In ISCA, 1995.
[39] M. Xu, R. Bodik, and M. D. Hill. A ’Flight Data Recorder’ for enabling
full-system multiprocessor deterministic replay. In ISCA, 2003.
[40] M. Xu, R. Bodik, and M. D. Hill. A regulated transitive reduction (RTR)
for longer memory race recording. In ASPLOS, 2006.
[41] P. Zhou, R. Teodorescu, and Y. Zhou. HARD: Hardware-assisted
lockset-based race detection. In HPCA, 2007.
[42] Y. Zhou, P. Zhou, F. Qin, W. Liu, and J. Torrellas. Efficient and flexible
architectural support for dynamic monitoring. ACM TACO, 2(1), 2005.
