Practical Byte-Granular Memory Blacklisting using Califorms by Sasaki, Hiroshi et al.
Practical Byte-Granular Memory Blacklisting using
Califorms
Hiroshi Sasaki
Columbia University
sasaki@cs.columbia.edu
Miguel A. Arroyo
Columbia University
miguel@cs.columbia.edu
M. Tarek Ibn Ziad
Columbia University
mtarek@cs.columbia.edu
Koustubha Bhat†
Vrije Universiteit Amsterdam
k.bhat@vu.nl
Kanad Sinha
Columbia University
kanad@cs.columbia.edu
Simha Sethumadhavan
Columbia University
simha@cs.columbia.edu
Abstract
Recent rapid strides in memory safety tools and hardware
have improved software quality and security. While coarse-
grained memory safety has improved, achieving memory
safety at the granularity of individual objects remains a chal-
lenge due to high performance overheads which can be be-
tween ∼1.7x−2.2x. In this paper, we present a novel idea
called Califorms, and associated program observations, to
obtain a low overhead security solution for practical, byte-
granular memory safety.
The idea we build on is called memory blacklisting, which
prohibits a program from accessing certain memory regions
based on program semantics. State of the art hardware-
supported memory blacklisting while much faster than soft-
ware blacklisting createsmemory fragmentation (of the order
of few bytes) for each use of the blacklisted location. In this
paper, we observe that metadata used for blacklisting can be
stored in dead spaces in a program’s data memory and that
this metadata can be integrated into microarchitecture by
changing the cache line format. Using these observations,
Califorms based system proposed in this paper reduces the
performance overheads of memory safety to ∼1.02x−1.16x
while providing byte-granular protection and maintaining
very low hardware overheads.
The low overhead offered by Califorms enables always
on, memory safety for small and large objects alike, and
the fundamental idea of storing metadata in empty spaces,
and microarchitecture can be used for other security and
performance applications.
1 Introduction
With recent interest in microarchitecture side channels, it
is important not to lose sight of more traditional software
security threats. Security is a full-system property where
both software and hardware have to be secure for a system
to be secure. Historically, program memory safety violations
have provided a significant opportunity for exploitation: for
instance, a recent report from Microsoft revealed that the
root cause of more than half of all exploits were software
†Part of this work was carried out while the author was a visiting student
at Columbia University.
memory safety violations [1]. In response to the severity of
this threat, improvements in software checking tools, such as
AddressSanitizer [2], and advances in the form of commercial
hardware support formemory safety such as Oracle’s ADI [3]
and Intel’s MPX [4] have enabled programmers to detect and
fix memory safety violations before deploying software.
Current software and hardware-supported solutions excel
at providing coarse-grained memory safety, i.e., detecting
memory access beyond arrays and malloc’d regions (struct
and class instances). However, they are not suitable for fine-
grained memory safety (i.e., detecting overflows within ob-
jects, such as fields within a struct, or members within a
class) due to the high performance overheads and/or need
for making intrusive changes to the source code [5]. For
instance, a recent work that aims to provide intra-object
overflow protection functionality incurs a 2.2x performance
overhead [6]. These overheads are problematic because they
not only reduce the number of pre-deployment tests that
can be performed, but also impede post-deployment con-
tinuous monitoring, which researchers have pointed out is
necessary for detecting benign and malicious memory safety
violations [7]. Thus, a low overhead memory safety solution
that can enable continuous monitoring and provide complete
program safety has been elusive.
The source of overheads stem from how current designs
store and use metadata necessary for enforcing memory
safety. In Intel MPX [4], Hardbound [8], CHERI [9, 10], and
PUMP [11], the metadata is stored for each pointer, and each
data or code memory access through a pointer performs
checks using the metadata. Since C/C++ memory accesses
tend to be highly pointer based, the performance and energy
overheads of accessing metadata can be significant in such
systems. Furthermore, the management of metadata espe-
cially if it is stored in a disjoint manner from the pointer can
also create significant engineering complexity in terms of
performance and usability. This was evidenced by the fact
that compilers like LLVM and GCC dropped support for Intel
MPX in their mainline after an initial push to integrate into
the toolchain [4].
Our approach for reducing overheads is two-fold. First,
instead of checking access bounds for each pointer access, we
1
ar
X
iv
:1
90
6.
01
83
8v
3 
 [c
s.C
R]
  1
0 J
un
 20
19
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
blacklist all memory locations that should never be accessed.
In theory, this is a strictly weaker form of security than
whitelisting but we argue that in practice, blacklisting can
be more practical because of its ease of deployment and
low overheads. Informally, deployments apply whitelisting
techniques partially to reduce overheads and be backward
compatible which reduces their security, while blacklisting
techniques can be applied more broadly due to their low
overheads. Additionally, blacklisting techniques complement
defenses in existing systems better since they do not require
intrusive changes.
Our second optimization is the novel metadata storage
scheme.We observe that by using deadmemory spaces in the
program, we can store metadata needed for memory safety
for free for nearly half of the program objects. These dead
spaces occur because of language alignment requirements
and are inserted by the compiler. When we cannot find a
naturally occurring dead space, we manually insert a dead
space. The overhead due to this dead space is smaller than
traditional methods for storing metadata because of how we
represent the metadata: our metadata is smaller (one byte)
as opposed to multiple bytes with traditional whitelisting or
blacklisting memory safety techniques.
A natural question is how the dead (more commonly re-
ferred to as padding) bytes can be distinguished from normal
bytes in memory. A straightforward scheme results in one
bit of additional storage per byte to identify if a byte is a
dead byte; this scheme results in a space overhead of 12.5%.
We reduce this overhead to one bit per 64B cache line (0.2%
overhead) without any loss of precision by only reformatting
how data is stored in cache lines. Our technique, Califorms,
uses one bit of additional storage to identify if the cache line
associated with the memory contains any dead bytes. For cal-
iformed cache lines, i.e., lines which contain dead bytes, the
actual data is stored following the “header”, which indicates
the location of dead bytes, as shown in Figure 1.
With this support, it is easy to describe how a Califorms
based system for memory safety works. The dead bytes,
either naturally harvested or manually inserted, are used to
indicate memory regions that should never be accessed by
a program (i.e., blacklisting). If an attacker accesses these
regions, we detect this rogue access without any additional
metadata accesses as our metadata resides inline.
Our experimental results on the SPEC CPU2006 bench-
mark suite indicate that the overheads of Califorms are quite
low: software overheads range from 2 to 14% slowdown
(or alternatively, 1.02x to 1.16x performance overhead) de-
pending on the amount and location of padding bytes used.
This provides the functionality for the user/customer to tune
the security according to their performance requirements.
Hardware induced overheads are also negligible, on aver-
age less than 1%. All of the software transformations are
performed using the LLVM compiler framework using a
front-end source-to-source transformation. These overheads
A B C D E Header A B C D E
Dead byte
Natural Califorms
Core L1D
L2
1 3 7 4 8 6 7 5 1 3 7 4 8 6 7 5
Natural Natural
Figure 1. Califorms offers memory safety by detecting ac-
cesses to dead bytes in memory. Dead bytes are not stored be-
yond the L1 data cache and identified using a special header
in the L2 cache (and beyond) resulting in very low overhead.
The conversion between these formats happens when lines
are filled or spilled between the L1 and L2 caches. The ab-
sence of dead bytes results in the cache lines stored in the
same natural format across memory system.
are substantially lower compared to the state-of-the-art soft-
ware or hardware supported schemes (viz., 2.2x performance
and 1.1x memory overheads for EffectiveSan [6], and 1.7x
performance and 2.1x memory overheads for Intel MPX [4]).
2 Motivation
One of the key ways in which we mitigate the overheads
for fine-grained memory safety is by opportunistically har-
vesting padding bytes in programs to store metadata. So how
often do these occur in programs? Before we answer that
question let us concretely understand padding bytes with
an example. Consider the struct A defined in Listing 1(a).
Let us say the compiler inserts a three-byte padding in be-
tween char c and int i as in Listing 1(b) because of the
C language requirement that integers should be padded to
their natural size (which we assume to be four bytes here).
These types of paddings are not limited to C/C++ but also
many other languages and their runtime implementations.
To obtain a quantitative estimate on the amount of paddings,
we developed a compiler pass to statically collect the padding
size information. Figure 3 presents the histogram of struct
densities for SPEC CPU2006 C and C++ benchmarks and
the V8 JavaScript engine. Struct density is defined as the
sum of the size of each field divided by the total size of the
struct including the padding bytes (i.e., the smaller or sparse
the struct density the more padding bytes the struct has).
The results reveal that 45.7% and 41.0% of structs within
SPEC and V8, respectively, have at least one byte of padding.
This is encouraging since even without introducing addi-
tional padding bytes (no memory overhead), we can offer
protection for certain compound data types restricting the
remaining attack surface.
Naturally, one might inquire about the safety for the rest
of the program. To offer protection for all defined compound
data types (called the full strategy), we can insert random
2
Practical Byte-Granular Memory Blacklisting using Califorms
struct A {
char c;
int i;
char buf[64];
void (*fp)();
double d;
}
(a) Original.
struct A_opportunistic {
char c;
/* compiler inserts padding
* bytes for alignment */
char padding_bytes[3];
int i;
char buf[64];
void (*fp)();
double d;
}
(b) Opportunistic.
struct A_full {
/* we protect every field with
* random security bytes */
char security_bytes[2];
char c;
char security_bytes[1];
int i;
char security_bytes[3];
char buf[64];
char security_bytes[2];
void (*fp)();
char security_bytes[1];
double d;
char security_bytes[2];
}
(c) Full.
struct A_intelligent {
char c;
int i;
/* we protect boundaries
* of arrays and pointers with
* random security bytes */
char security_bytes[3];
char buf[64];
char security_bytes[2];
void (*fp)();
char security_bytes[3];
double d;
}
(d) Intelligent.
Listing 1. Example of three security bytes harvesting strategies: (b) opportunistic uses the existing padding bytes as security
bytes, (c) full protect every field within the struct with security bytes, and (d) intelligent surrounds arrays and pointers with
security bytes.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Struct density
0.0
0.2
0.4
0.6
0.8
Fr
ac
tio
n
of
st
ru
ct
s
(a) SPEC CPU2006 C and C++
benchmarks.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Struct density
0.0
0.2
0.4
0.6
0.8
Fr
ac
tio
n
of
st
ru
ct
s
(b) V8 JavaScript engine.
Figure 3. Struct density histogram of SPEC CPU2006 bench-
marks and the V8 JavaScript engine. More than 40% of the
structs have at least one padding byte.
Sl
ow
do
wn
0%
2%
4%
6%
8%
AVG
1B 2B 3B 4B 5B 6B 7B
Figure 4. Average performance overhead with additional
paddings (one byte to seven bytes) inserted for every field
within structs (and classes) of SPEC CPU2006 C and C++
benchmarks.
sized padding bytes, also referred to as security bytes, be-
tween every field of a struct or member of a class as in
Listing 1(c). Random sized security bytes are chosen to pro-
vide a probabilistic defense as fixed sized security bytes can
be jumped over by an attacker once s/he identifies the actual
size (and the exact memory layout). Additionally, by carefully
choosing the minimum and maximum sizes for insertion, we
can keep the average security byte size small (such as two or
three bytes). Intuitively, the higher the unpredictability (or
randomness) there is within the memory layout, the higher
the security level we can offer.
While the full strategy provides the widest coverage, not
all of the security bytes provide the same security utility.
For example, basic data types such as char and int cannot
be easily overflowed past their bounds. The idea behind the
intelligent insertion strategy is to prioritize insertion of se-
curity bytes into security-critical locations as presented in
Listing 1(d). We choose data types which are most prone
to abuse by an attacker via overflow type accesses: (1) ar-
rays and (2) data and function pointers. In the example in
Listing 1(d), the array buf[64] and the function pointer fp
are protected with random sized security bytes. While it is
possible to utilize padding bytes present between other data
types without incurring memory overheads, doing so would
come at an additional performance overhead.
In comparison to opportunistic harvesting, the other more
secure strategies (e.g., full strategy) come at an additional
performance overhead. We analyze the performance trend in
order to decide how many security bytes can be reasonably
inserted. For this purpose we developed an LLVM pass which
pads every field of a struct with fixed size paddings. We
measure the performance of SPEC CPU2006 benchmarks
by varying the padding size from one byte to seven bytes.
The detailed evaluation environment and methodology is
described later in Section 8.
Figure 4 demonstrates the average slowdown when in-
serting additional bytes for harvesting. As expected, we can
see the performance overheads increase as we increase the
padding size, mainly due to ineffective cache usage. On av-
erage the slowdown is 3.0% for one byte and 7.6% for seven
bytes of padding. The figure presents the ideal (lower bound)
performance overhead when fully inserting security bytes
into compound data types; the hardware and software mod-
ifications we introduce add additional overheads on top of
these numbers. We strive to provide a mechanism that allows
the user to tune the security level at the cost of performance
and thus explore several security byte insertion strategies to
reduce the performance overhead in the paper.
3
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
3 Full System Overview
The Califorms framework consists of multiple components
we discuss in the following sections:
• Architecture Support. An ISA extension of a ma-
chine instruction called CFORM that performs califorming
(i.e., (un)setting security bytes) of cache lines, and a privi-
leged Califorms exception which is raised upon misuse of
security bytes (Section 4).
• Microarchitecture Design. New cache line formats that
enable low cost access to themetadata—we propose different
Califorms for L1 cache vs. L2 cache and beyond (Section 5).
• Software Design. Compiler, memory allocator and oper-
ating system extensions which insert the security bytes at
compile time and manages the security bytes via the CFORM
instruction at runtime (Section 6).
At compile time each compound data type, a struct or a
class, is examined and security bytes are added according
to a user defined insertion policy viz. opportunistic, full or
intelligent, by a source-to-source translation pass. When we
run the binary with security bytes, when compound data
type instances are created in the heap dynamically, we use
a new version of malloc that issues CFORM instructions to
set the security bytes after the space is allocated. When
the CFORM instruction is executed, the cache line format is
transformed at the L1 cache controller (assuming a cache
miss) and is inserted into the L1 data cache. Upon an L1
eviction, the L1 cache controller re-califorms the cache line
to meet the Califorms of the L2 cache.
While we add additional metadata storage to the caches,
we refrain from doing so for main memory and persistent
storage to keep the changes local within the CPU core. When
a califormed cache line is evicted from the last-level cache to
main memory, we keep the cache line califormed and store
the additional one metadata bit into spare ECC bits simi-
lar to Oracle’s ADI [3].1 When a page is swapped out from
main memory, the page fault handler stores the metadata for
all the cache lines within the page into a reserved address
space managed by the operating system; the metadata is re-
claimed upon swap in. Therefore, our design keeps the cache
line format califormed throughout the memory hierarchy. A
califormed cache line is un-califormed only when the cor-
responding bytes cross the boundary where the califormed
data cannot be understood by the other end, such as writing
to I/O (e.g., pipe, filesystem or network socket). Finally, when
an object is freed, the freed bytes are califormed and zeroed
for offering temporal safety.
At runtime, when a rogue load or store accesses a cali-
formed byte the hardware returns a privileged, precise secu-
rity exception to the next privilege level which can take any
appropriate action including terminating the program.
1ADI stores four bits of metadata per cache line for allocation granularity
enforcement while Califorms stores one bit for sub-allocation granularity
enforcement.
Table 1. K-map for the CFORM instruction. X represents
“Don’t Care”.
R2, R3
X, Allow Set, Allow Set, Allow
In
iti
al Regular Byte Regular Byte Exception Security Byte
Security Byte Security Byte Regular Byte Exception
4 Architecture Support
4.1 CFORM Instruction
The format of the instruction is “CFORM R1, R2, R3”. The value
in register R1 points to the starting (cache aligned) address in
the virtual address space, denoting the start of the 64B chunk
which fits in a single 64B cache line. Table 1 represents a
K-map for the CFORM instruction. The value in register R2
indicates the attributes of said region represented in a bit
vector format (1 to set and 0 to unset the security byte). The
value in register R3 is a mask to the corresponding 64B re-
gion, where 1 allows and 0 disallows changing the state of
the corresponding byte. The mask is used to perform partial
updates of metadata within a cache line. We throw a privi-
leged Califorms exception when the CFORM instruction tries
to set a security byte to an existing security byte location,
and unset a security byte from a normal byte.
The CFORM instruction is treated similar to a store instruc-
tion in the processor pipeline, where it first fetches the corre-
sponding cache line into the L1 data cache upon an L1 miss
(assuming a write allocate cache policy). Next, it manipulates
the bits in the metadata storage to appropriately set or unset
the security bytes.2
4.2 Privileged Exceptions
When the hardware detects an access violation, it throws
a privileged exception once the instruction becomes non-
speculative. There are some library functions which violate
the aforementioned operations security bytes such as memcpy
so we need a way to suppress the exceptions. In order to
whitelist such functions, we manipulate the exception mask
registers and let the exception handler decide whether to
suppress the exception or not. Although privileged exception
handling is more expensive than handling user-level excep-
tions (because it requires a context switch to the kernel), we
stick with the former to limit the attack surface. We rely on
the fact that the exception itself is a rare event and would
have negligible effect on performance.
2We also investigate the possibility of using a variant of CFORM instruction
which does not store the modified cache line into the L1 data cache, just
like the non-temporal (or streaming) load/store instructions (e.g., MOVNTI,
MOVNTQ, etc) in Section 6.1.
4
Practical Byte-Granular Memory Blacklisting using Califorms
[0]
1bit
8B
Security byte?
Add’l storage
[0]
64B
Cache line (data)
1B
[1] [63]
1
[1] [63]
Figure 5. Califorms-bitvector: L1 Califorms implementation
using a bit vector that indicates whether each byte is a secu-
rity byte. HW overhead of 8B per 64B cache line.
Address calc
Off
se
t
In
de
x
Ta
g 
et
c.
Ad
dr
es
s
De
co
de
r
Tag
Array =
Data
Array Al
ig
ne
r
Da
ta
Metadata
Array Ca
lifo
rm
s
Ch
ec
ke
r
Ex
ce
pt
io
n?
Figure 6. Pipeline diagram for the L1 cache hit operation.
The shaded components correspond to Califorms.
5 Microarchitecture Design
The microarchitectural support for our technique aims to
keep the common case fast: L1 cache uses the straightfor-
ward scheme of having one bit of additional storage per byte.
All califormed cache lines are transformed to the straightfor-
ward scheme at the L1 data cache controller so that typical
loads and stores which hit in the L1 cache do not have to
perform address calculations to figure out the location of
original data (which is required for Califorms of L2 cache
and beyond). This design decision guarantees that for the
common case the latencies will not be affected due to secu-
rity functionality. Beyond the L1, the data is stored in the
optimized califormed format, i.e., one bit of additional stor-
age for the entire cache line. The transformation happens
when the data is filled in or spilled from the L1 data cache
(between the L1 and L2), and adds minimal latency to the L1
miss latency. For main memory, we store the additional bit
per cache line size in the DRAM ECC spare bits, thus com-
pletely removing any cycle time impact on DRAM access or
modifications to the DIMM architecture.
5.1 L1 Cache: Bit Vector Approach
To satisfy the L1 design goal we consider a naive (but low
latency) approach which uses a bit vector to identify which
bytes are security bytes in a cache line. Each bit of the bit
vector corresponds to each byte of the cache line and repre-
sent its state (normal byte or security byte). Figure 5 presents
a schematic view of this implementation califorms-bitvector.
The bit vector requires a 64-bit (8B) bit vector per 64B cache
line which adds 12.5% storage overhead for just the L1-D
caches (comparable to ECC overhead for reliability).
[4] [63]
64B
Cache line (data)
4B
Line califormed?
1bit
Add’l storage
1
[5] [62][0] [3]
[0] [1] [2] [3]
# of sec. bytes
00: 1
01: 2
10: 3
11: 4+
00 Addr0
01 Addr0 Addr1
10 Addr0 Addr1 Addr2
11 Addr0 Addr1 Addr2 Addr3 Sentinel
2bit
6bit
[1] [2]
[1] [2] [3]
[2] [3]
[3]
Figure 7. Califorms-sentinel that stores a bit vector in se-
curity byte locations. HW overhead of 1-bit per 64B cache
line.
Figure 6 shows the L1 data cache hit path modifications for
Califorms. If a load accesses a califormed byte (which is de-
termined by reading the bit vector) an exception is recorded
to be processed when the load is ready to be committed.
Meanwhile, the load returns a pre-determined value for the
security byte (in our design the value 0 which is the value
that the memory region is initialized to upon deallocation).
The reason to return the pre-determined value is to avoid
a speculative side channel attack to identify security byte
locations and is discussed in greater detail in Section 7. On
store accesses to califormed bytes we report an exception
before the store commits.
5.2 L2 Cache and Beyond: Sentinel Approach
For L2 and beyond, we take a different approach that al-
lows us to recognize whether each byte is a security byte
with fewer bits, as using the L1 metadata format throughout
the system will increase the cache area overhead by 12.5%,
which may not be acceptable. Figure 7 illustrates our pro-
posed califorms-sentinel, which has a 1-bit or 0.2% metadata
overhead per 64B cache line.
The key insight that enables these savings is the following
observation: the number of addressable bytes in a cache line
is less than what can be represented by a single byte (we
only need six bits). For example, let us assume that there is
(at least) one security byte in a 64B cache line. Considering a
byte granular protection there are at most 63 unique values
(bytes) that non-security bytes can have. Therefore, we are
guaranteed to find a six bit pattern which is not present
in any of the normal bytes’ least (or most) significant six
bits. We use this pattern as a sentinel value to represent the
security bytes within the cache line.
If we store the six bit sentinel value as additional metadata,
the overhead will be seven bits (six bits plus one bit to specify
if the cache line is califormed) per cache line. Instead, we
propose a new cache line format which stores the sentinel
value within a security byte to reduce the metadata overhead
5
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
1: Read the Califorms metadata for the evicted line and OR them
2: if result is 0 then
3: Evict the line as is and set Califorms bit to 0
4: else
5: Set Califorms bit to 1
6: Perform following operations on the cache line:
7: Scan least 6-bit of every byte to determine sentinel
8: Get locations of 1st 4 security bytes
9: Store data of 1st 4 bytes in locations obtained in 8:
10: Fill the 1st 4 bytes based on Figure 7
11: Use the sentinel to mark the remaining security bytes
12: end
Algorithm 1. Califorms conversion from the L1 cache
(califorms-bitvector) to L2 cache (califorms-sentinel).
1: Read the Califorms bit for the inserted line
2: if result is 0 then
3: Set the Califorms metadata bit vector to [0]
4: else
5: Perform following operations on the cache line:
6: Check the least significant 2-bit of byte 0
7: Set the metadata of byte[Addr[0-3]] to 1 based on 6:
8: Set the metadata of byte[Addr[byte == sentinel]] to 1
9: Set the data of byte[0-3] to byte[Addr[0-3]]
10: Set the new locations of byte[Addr[0-3]] to zero
11: end
Algorithm 2. Califorms conversion from the L2 cache
(califorms-sentinel) to L1 cache (califorms-bitvector).
down to one bit per cache line. The idea is to use four different
formats depending on the number of security bytes in the
cache line, as we explain below.
Califorms-sentinel stores the metadata into the first four
bytes (at most) of the 64B cache line. Two bits of the 0th byte
is used to specify the number of security bytes within the
cache line: 00, 01, 10 and 11 represent one, two, three, and
four or more security bytes, respectively. If there is only one
security byte in the cache line, we use the remaining six bits
of the 0th byte to specify the location of the security byte
(and the original value of the 0th byte is stored in the security
byte). Similarly when there is two or three security bytes
in the cache line, we use the bits of the 1st and 2nd bytes
to locate them. The key observation is that, we gain two
bits per security byte since we only need six bits to specify
a location in the cache line. Therefore when we have four
security bytes, we can locate four addresses and have six bits
remaining in the first four bytes. This remaining six bits can
be used to store a sentinel value, which allows us to have
any number of additional security bytes.
Although the sentinel value depends on the actual values
within the 64B cache line, it works naturally with a write-
allocate L1 cache, which is the most commonly used cache
allocation policy in modern microprocessors. The cache line
format can be converted upon L1 cache eviction and inser-
tion (califorms-bitvector to/from califorms-sentinel), and the
sentinel value only needs to be found upon L1 cache eviction.
Also, it is important to note that califorms-sentinel supports
critical-word first delivery since the security byte locations
can be quickly retrieved by scanning only the first 4B of
the first 16B flit. Algorithms 1 and 2 describe the high-level
process used for converting from L1 to L2 Califorms and vice
versa.
Figure 8 shows the logic diagram for the spill module. The
circled numbers refer to the corresponding steps in Algo-
rithm 1. In the top-left corner, the Califorms metadata for
the evicted line is ORed to construct the L2 cache (califorms-
sentinel) metadata bit. The bottom-right square details the
process of determining sentinel. We scan least 6-bit of every
byte, decode them, and OR the output to construct a used-
values vector. The used-values vector is then processed by a
Find-index block to get the sentinel (line 7). The Find-index
block takes a 64-bit input vector and searches for the index of
the first zero. It is constructed using 64 shift blocks followed
by a single comparator.
The top-right corner of Figure 8 shows the logic for getting
the locations of the first four security bytes (line 8). It consists
of four successive combinational Find-index blocks (each
detecting one security byte) in our evaluated design. This
logic can be easily pipelined into four stages, if needed, to
completely hide the latency of the spill process in the pipeline.
Finally, we store the data of the first four bytes in locations
obtained from the Find-index blocks and fill the same four
bytes based on Figure 7.
Figure 9 shows the logic diagram for the fill module, as
summarized in Algorithm 2. The blue (==) blocks are con-
structed using logic comparators. The Califorms bit of the
L2 inserted line is used to control the value of the L1 cache
(califorms-bitvector) metadata. The first two bits of the L2
inserted line are used as inputs for the comparators to decide
on the metadata bits of the first four bytes as specified in
Figure 7. Only if those two bits are 11, the sentinel value
is read from the fourth byte and fed, with the least 6-bits
of each byte, to 60 comparators simultaneously to set the
rest of the L1 metadata bits. Such parallelization reduces the
latency impact of the fill process.
5.3 Load/Store Queue Modifications
Since the CFORM instruction updates the architecture state
(writes values), it is functionally a store instruction and han-
dled as such in the pipeline. However, there is a key differ-
ence: unlike a store instruction, the CFORM instruction should
not forward the value to a younger load instruction whose
address matches within the load/store queue (LSQ) but in-
stead return the value zero. This functionality is required
to provide tamper-resistance against side-channel attacks.
Additionally, upon an address match, both load and store
instructions subsequent to an in flight CFORM instruction are
marked for Califorms exception (which is thrown when the
instruction is committed).
In order to detect an address match in the LSQ with
a CFORM instruction, first a cache line address should be
matched with all the younger instructions. Subsequently
upon a match, the value stored in the LSQ for the CFORM
instruction, which contains the mask value indicating
6
Practical Byte-Granular Memory Blacklisting using Califorms
[63][0] [1] ...
L1 Califorms bitvector
...
L2 Califorms metadata
[0] ...
L1 Cacheline data
6bit2bit
[1]
6bit
[63]
6bit
Decoder 
6 x 64
... Decoder 6 x 64
... Decoder 6 x 64
......
1 for used
0 for unused
64bit
[63]
[0]
...
[1]
...
...
6bit
Find Index 
of First Bit
of Value 0 Sentinel 
Value
1
7
64bit
6bit
Find Index 
of First Bit
of Value 1
mask
6bit
Find Index 
of First Bit
of Value 1
mask
1st security 
byte
2nd security 
byte
6bit
Find Index 
of First Bit
of Value 1
mask
3rd security 
byte
6bit
Find Index 
of First Bit
of Value 1
mask
4th security 
byte
Cross Bar & Combinational Logicdata of 
first 
4 bytes
8
[0] ...
L2 Cacheline data
[1] [63]
9
Figure 8. Logic diagram for Califorms conversion from the L1 cache (califorms-bitvector) to L2 cache (califorms-sentinel). The
green Find-index blocks are constructed using 64 shift blocks followed by a single comparator. The circled numbers refer to
the corresponding steps in Algorithm 1.
[63][0] [1] ...
L1 Califorms bitvectorL2 Califorms metadata
1bit
[2] [3]
0
1
0
1
0
1
0
0
1
0
0
1
0
0
1
0...
[4][63][0] ...
L2 Cacheline data
2bit
!=00
?
==10
?
==11
? ==11
?
6bit
==
Sentinel
?
Figure 9. Logic diagram for Califorms conversion from
the L2 cache (califorms-sentinel) to L1 cache (califorms-
bitvector), as described in Algorithm 2. The blue (==) blocks
are constructed using logic comparators.
to-be-califormed bytes, is used to confirm the final match. To
facilitate a match with a CFORM instruction, each LSQ entry
should be associated with a bit to indicate whether the entry
contains a CFORM instruction. Detecting a complete match
may take multiple cycles, however, a legitimate load/store
instruction should never be forwarded a value from a CFORM
instruction, and thus the store-to-load forwarding from a
CFORM instruction is not on the critical path of the program
(i.e., its latency should not affect the performance, and we
do not evaluate its effect in our evaluation). Alternately, if
LSQ modifications are to be avoided, the CFORM instructions
can be surrounded by memory serializing instructions
(i.e., ensure that CFORM instructions are the only in flight
memory instructions).
6 Software Design
We describe compiler support, the memory allocator changes
and the operating system changes to support Califorms in
the following.
6.1 Dynamic Memory Management
We can consider two approaches to applying security bytes:
(1) Dirty-before-use. Unallocated memory has no security
bytes. We set security bytes upon allocation and unset them
upon deallocation; or (2) Clean-before-use. Unallocated mem-
ory remains filled with security bytes all the time. We clear
the security bytes (in legitimate data locations) upon allo-
cation and set them upon deallocation. Ensuring temporal
memory safety in the heap remains a non-trivial problem [1].
We therefore choose to follow a clean-before-use approach
in the heap, so that deallocated memory regions remain pro-
tected by califormed security bytes3. Additionally, in order
to provide temporal memory safety, we do not reallocate re-
cently freed regions until the heap is sufficiently consumed
(quarantining). Compared to the heap, the security benefits
are limited for the stack since temporal attacks on the stack
(e.g., use-after-return attacks) are much rarer. Hence, we
apply the dirty-before-use scheme on the stack.
3It is natural to use the non-temporal CFORM instruction when deallocating
a memory region; deallocated region is not meant to be used by the program
and thus polluting the L1 data cache with those memory is harmful and
should be avoided. Not evaluated in this paper is the use of non-temporal
instructions which should provide better performance.
7
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
6.2 Compiler
Our compiler-based instrumentation infers where to place
security bytes within target objects, based on their type lay-
out information. The compiler pass supports three insertion
policies: the first opportunistic policy supports security bytes
insertion into existing padding bytes within the objects, and
the other two support modifying object layouts to introduce
randomly sized security byte spans that follow the full or
intelligent strategies described in Section 2. The first policy
aims at retaining interoperability with external codemodules
(e.g., shared libraries) by avoiding type layout modification.
Where this is not a concern, the latter two policies help offer
stronger security coverage — exhibiting a tradeoff between
security and performance.
6.3 Operating System Support
We need the following support in the operating system:
• Privileged Exceptions. As the Califorms exception is
privileged, the operating system needs to properly handle it
as with other privileged exceptions (e.g., page faults). We also
assume the faulting address is passed in an existing register
so that it can be used for reporting/investigation purposes.
Additionally, for the sake of usability and backwards compat-
ibility, we have to accommodate copying operations similar
in nature to memcpy. For example, a simple struct to struct
assignment could trigger this behavior, thus leading to a
potential breakdown of califormed software. Hence, in order
to maintain usability, we allow whitelisting functionality to
suppress the exceptions. This is done by issuing a privileged
store instruction to modify the value of exception mask reg-
isters before entering and after exiting the according piece
of code. We discuss the implications of this design choice in
Section 7.
• Page Swaps. As we have discussed in Section 3, data with
security bytes is stored in main memory in a califormed
format. When a page with califormed data is swapped out
from main memory, the page fault handler needs to store
the metadata for the entire page into a reserved address
space managed by the operating system; the metadata is
reclaimed upon swap in. The kernel has enough address
space in practice (kernel’s virtual address space is 128TB for
a 64-bit Linux with 48-bit virtual address space) to store the
metadata for all the processes on the system since the size
of the metadata for a 4KB page consumes only 8B.
7 Security Discussion
7.1 Threat Model
For the security evaluation of this work, we assume a threat
model comparable to that used in contemporary related
works. We assume the victim program to have one or more
vulnerabilities that an attacker can exploit to gain arbitrary
read and write capabilities in the memory. Furthermore, we
assume that the adversary has access to the source code of
the program, therefore s/he is able to glean all source-level
information and/or deterministic compilation results from it
(e.g., find code gadgets within the program and determine
non-califormed layouts of data structures). However, s/he
does not have access to the host binary (e.g., server-side ap-
plications). Finally, we assume that all hardware is trusted:
it does not contain and/or is not subject to bugs arising from
exploits such as physical or glitching attacks. Due to its re-
cent rise in relevance however, we maintain side channel
attacks in our design of Califorms within the purview of our
threats. Specifically, we accommodate attack vectors seeking
to leak the location and value of security bytes.
7.2 Hardware Attacks and Mitigations
• Metadata Tampering Attacks. A key feature of Cali-
forms as a metadata-based safety mechanism is the absence
of programmer visible metadata in the general case (apart
from a metadata bit in the page information maintained by
higher privilege software). Beyond the implications for its
storage overhead, this also means that our technique is im-
mune to attacks that explicitly aim to leak or tamper the
metadata to bypass the respective defense. This, in turn, im-
plies a smaller attack surface so far as software maintenance
of metadata is concerned.
• Bit-granularity Attacks. Califorms’s capability of fine-
grainedmemory protection is the key enabler for intra-object
overflow detection. However, our byte granular mechanism
is not enough for protecting bit-fields without turning them
into char bytes functionally. This should not be a major
detraction since security bytes can still be added around
composites of bit-fields.
• Heterogeneous Architectural Attacks. Califorms’
hardware modifications affect the memory hierarchy.
Hence, its protection is lost whenever one of its layers
is bypassed (e.g., heterogeneous architectures or DMA
is used). Mitigating this requires that these mechanisms
always respect the security byte semantics by propagating
them along the respective memory structures and detecting
accesses to them. If the algorithm used for califorming is
used by accelerators then attacks through heterogeneous
components can also be averted.
• Side-Channel Attacks. Our design takes multiple steps
to be resilient to side channel attacks. Firstly, we purposefully
avoid timing variances introduced due to our hardware mod-
ifications in order to avoid timing based side channel attacks.
Additionally, to avoid speculative execution side channels ala
Spectre [12], our design returns zero on a load to a security
byte, thus preventing speculative disclosure of metadata. We
augment this further by requiring that deallocated objects
(heap or stack) be zeroed out in software [13]. This is to
avoid the following attack scenario: consider a case if the
attacker somehow knows that the padding locations should
contain a non-zero value (for instance, because s/he knows
8
Practical Byte-Granular Memory Blacklisting using Califorms
the object allocated at the same location prior to the current
object had non-zero values). However, while speculatively
disclosing memory contents of the object, s/he discovers that
the padding location contains a zero instead. As such, s/he
can infer that the padding there contains a security byte. If
deallocations were accompanied with zeroing, however, this
assumption does not hold.
7.3 Software Attacks and Mitigations
• Coverage-Based Attacks. For califorming the padding
bytes (in an object), we need to know the precise type infor-
mation of the allocated object. This is not always possible in
C-style programs where void* allocations may be used. In
these cases, the compiler may not be able to infer the correct
type, in which case intra-object support may be skipped for
such allocations. Similarly, our metadata insertion policies
(viz., intelligent and full) require changes to the type lay-
outs. This means that interactions with external modules
that have not been compiled with Califorms support may
need (de)serialization to remain compatible. For an attacker,
such points in execution may appear lucrative because of
inserted security bytes getting stripped away in those short
periods. We note however that the opportunistic policy can
still remain in place to offer some protection.
On the other hand, for those interactions that remain obliv-
ious to type layout modifications (e.g., passing a pointer to
an object that shall remain opaque within the external mod-
ule), our hardware-based implicit checks have the benefit of
persistent tampering protection, even across binary module
boundaries.
• Whitelisting Attacks. Our concession of allowing
whitelisting of certain functions was necessary to make
Califorms more usable in common environments without
requiring significant source modifications. However, this
also creates a vulnerability window wherein an adversary
can piggy back on these functions in the source to bypass
our protection. To confine this vector, we keep the number
of whitelisted functions as minimal as possible.
• Derandomization Attacks. Since Califorms can be by-
passed if an attacker can guess a security bytes location, it
is crucial that it be placed unpredictably. For the attacker
to carry out a guessing attack, s/he first needs to obtain the
virtual memory address of the object they want to corrupt,
and then overwrite a certain number of bytes within that
object. To know the address of the object of interest, s/he
typically has to scan the process’ memory: the probability
of scanning without touching any of the security bytes is
(1 − P/N )O where O is number of allocated objects, N is
the size of each object, and P is number of security bytes
within that object. With 10% padding (P/N = 0.1), when O
reaches 250, the attack success goes to 10−20. If the attacker
can somehow reduce O to 1, which represents the ideal case
for the attacker, the probability of guessing the element of
interest is 1/7n (since we insert 1–7 wide security bytes),
compounding as the number of paddings to be guessed (= n)
increases.
The randomness is, however, introduced statically akin
to randstruct plugin introduced in recent Linux kernels
which randomizes structure layout of those which are speci-
fied (it does not offer detection of rogue accesses unlike Cal-
iforms do) [14, 15]. The static nature of the technique may
make it prone to brute force attacks like BROP [16] which
repeatedly crashes the program until the correct configura-
tion is guessed. This could be prevented by having multiple
versions of the same binary with different padding sizes or
simply by better logging, when possible. Another mitigating
factor is that BROP attacks require specific type of program
semantics, namely, automatic restart-after-crash with the
same memory layout. Applications with these semantics can
be modified to spawn with a different padding layout in our
case and yet satisfy application level requirements.
8 Performance Evaluation
8.1 Hardware Overheads
• Cache Access Latency Impact of Califorms. Califorms
adds additional state and operations to the L1 data cache
and the interface between the L1 and L2 caches. The goal
of this section is to evaluate the access latency impact of
the additional state and operations described in Section 5.
Qualitatively, the metadata area overhead of L1 Califorms
is 12.5%, and the access latency should not be impacted as
the metadata lookup can happen in parallel with the L1
tag access; the L1 to/from L2 califorms conversion should
also be simple enough so that its latency can be completely
hidden. However, the metadata area overhead can increase
the L1 tag access latency and the conversions might add little
latency. Without loss of generality, we measure the access
latency impact of adding califorms-bitvector on a 32KB direct
mapped L1 cache in the context of a typical energy optimized
tag, data, formatting L1 pipeline with multicycle fill/spill
handling. For the implementation we use the 65nm TSMC
core library, and generate the SRAM arrays with the ARM
Artisan memory compiler. Table 2 summarizes the results
for the L1 Califorms (califorms-bitvector).
As expected, the overheads associated with the califorms-
bitvector are minor in terms of delay (1.85%) and power
consumption (2.12%). We found the SRAM area to be the
dominant component in the total cache area (around 98%)
where the overhead was 18.69%, higher than 12.5%. The re-
sults of fill/spill modules are reported separately in the right
hand side of Table 2.
The latency impact of the fill operation is within the access
period of the L1 design. Thus, the califorming operation
can be folded completely within the pipeline stages that are
responsible for bringing cache lines from L2 to L1.
9
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
Table 2. Area, delay and power overheads of Califorms (GE represents gate equivalent). L1 Califorms (califorms-bitvector)
adds negligible delay and power overheads to the L1 cache access.
Design Main synthesis results L1 overheads Fill overheads Spill overheads
Name Area (GE) Delay (ns) Power (mW ) Area (%) Delay(%) Power (%) Area (GE) Delay (ns) Power (mW ) Area (GE) Delay (ns) Power (mW )
Baseline 347,329.19 1.62 15.84 — — — — — — — — —
L1 Califorms 412,263.87 1.65 16.17 18.69 1.85 2.12 8,957.16 1.43 0.18 34,561.80 5.50 0.52
The timing delay of the (less performance sensitive) spill
operation is larger than that of the fill operation (5.5ns vs.
1.4ns) as we use pure combinational logic to construct the
califorms-sentinel format in one cycle, as shown in Figure 8.
This cycle period can be reduced by dividing the operations
of Algorithm 1 (lines 7 to 11) into two or more pipeline stages.
For instance, getting the locations of the first four security
bytes (line 8) consists of four successive combinational blocks
(each detecting one security byte) in our evaluated design.
This logic can be easily pipelined into four stages. Therefore
we believe that the latency of both the fill and spill operations
can be minimal (or completely hidden) in the pipeline.
• Performance with Additional Cache Access Latency.
Our results from the VLSI implementation imply that there
will be no additional L2/L3 latency imposed by implementing
Califorms. However, this might not be the case depending on
several implementation details (e.g., target clock frequency)
so we pessimistically assume that the L2/L3 access latency
incurs additional one cycle latency overhead. In order to
evaluate the performance of the additional latency posed by
Califorms, we perform detailed microarchitectural simula-
tions.
We use ZSim [17] as the processor simulator and use
PinPoints [18] with Intel Pin [19], to select representative
simulation regions of the SPEC CPU2006 benchmarks with
ref inputs compiled with Clang version 6.0.0 with “-O3
-fno-strict-aliasing” flags. We do not warmup the simu-
lator upon executing each SimPoint region but instead use a
relatively large interval length of 500M instructions to avoid
any warmup issues. We set MaxK used in SimPoint region
selection to 30.4
Table 3 shows the parameters of the processor, an Intel
Westmere-like out-of-order core which has been validated
against a real system whose performance and microarchitec-
tural events to be commonly within 10% [17]. We evaluate
the performance when both L2 and L3 caches incur addi-
tional latency of one cycle.
As shown in Figure 10 slowdowns range from 0.24%
(hmmer) to 1.37% (xalancbmk). The average performance
4For some benchmark-input pairs we have seen discrepancies in the num-
ber of instructions measured by PinPoints vs. ZSim and thus the ap-
propriate SimPoint regions might not be simulated. Those inputs are:
foreman_ref_encoder_main for h264ref and pds-50 for soplex. Also,
due to time constraints, we could not complete executing SimPoint
for h264ref with sss_encoder_main input and excluded it from the
evaluation.
Table 3. Hardware configuration of the simulated system.
Core x86-64 Intel Westmere-like OoO core at 2.27GHz
L1 inst. cache 32KB, 4-way, 3-cycle latency
L1 data cache 32KB, 8-way, 4-cycle latency
L2 cache 256KB, 8-way, 7-cycle latency
L3 cache 2MB, 16-way, 27-cycle latency
DRAM 8GB, DDR3-1333
Sl
ow
do
wn
0%
0.5%
1%
1.5%
as
ta
r
bz
ip2
de
alI
I
gc
c
go
bm
k
h2
64
re
f
hm
m
er lbm
lib
qu
an
tu
m m
cf
m
ilc
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
sje
ng
so
ple
x
sp
hin
x3
xa
lan
cb
m
k
AV
G
Figure 10. Slowdown with additional one-cycle access la-
tency for both L2 and L3 caches.
slowdown is 0.83% which is negligible and is well in the
range of error when executed on real systems.
8.2 Software Performance Overheads
Our evaluations so far revealed that the hardware modifi-
cations required to implement Califorms add little or no
performance overhead. Here, we evaluate the overheads in-
curred by the two software based changes required to enable
intra-object memory safety with Califorms: the effect of un-
derutilized memory structures (e.g., caches) due to additional
security bytes, and the additional work necessary to issue
CFORM instructions (and the overhead of executing the in-
structions themselves).
• Evaluation Setup. We run the experiments on an Intel
Skylake-based Xeon Gold 6126 processor running at 2.6GHz
with RHEL Linux 7.5 (kernel 3.10). We omit dealII and
omnetpp since the shared libraries installed on RHEL are too
old to execute these two Califorms enabled binaries, and gcc
since it fails when executed with the memory allocator with
inter-object spatial and temporal memory safety support.
The remaining 16 SPEC CPU2006 C/C++ benchmarks are
compiled with our modified Clang version 6.0.0 with “-O3
-fno-strict-aliasing” flags. We use the ref inputs and
run to completion. We run each benchmark-input pair five
times and use the shortest execution time as its performance.
For the benchmarks with multiple ref inputs, the sum of the
10
Practical Byte-Granular Memory Blacklisting using Califorms
Sl
ow
do
wn
-10%
0%
10%
20%
30%
40%
50%
astar bzip2 gobmk h264ref hmmer lbm libquantum mcf milc namd perlbench povray sjeng soplex sphinx3 xalancbmk AVG
1-3B 1-5B 1-7B Opportunistic CFORM 1-3B CFORM 1-5B CFORM 1-7B CFORM
80.3%
85.2%
85.2%
85.2%
Figure 11. Slowdown of the opportunistic policy, and full insertion policy with random sized security bytes (with and without
CFORM instructions). The average slowdowns of opportunistic and full insertion policies are 6.2% and 14.2%, respectively.
Sl
ow
do
wn
-5%
0%
5%
10%
15%
20%
astar bzip2 gobmk h264ref hmmer lbm libquantum mcf milc namd perlbench povray sjeng soplex sphinx3 xalancbmk AVG
1-3B 1-5B 1-7B 1-3B CFORM 1-5B CFORM 1-7B CFORM
Figure 12. Slowdown of the intelligent insert policy with random sized security bytes (with and without CFORM instructions).
The average slowdown is 2.0%.
execution time of all the inputs are used as their execution
times.5
We estimate the performance impact of executing a CFORM
instruction by emulating it with a dummy store instruction
that writes some value to the corresponding cache line’s
padding byte. Since one CFORM instruction can caliform the
entire cache line, issuing one dummy store instruction per
to-be-califormed cache line suffices. In order to issue the
dummy stores, we implement a LLVM pass to instrument
the code to hook into memory allocations and deallocations.
We then retrieve the type information to locate the padding
bytes, calculate the number of dummy stores and the address
they access, and finally emit them. Therefore, all the software
overheads we need to pay to enable Califorms are accounted
for in our evaluation.
For the random sized security bytes, we evaluate three
variants: we fix the minimum size to one byte while varying
the maximum size to three, five and seven bytes (i.e., on
average the amount of security bytes inserted are two, three
and four bytes, respectively). In addition, in order to account
for the randomness introduced by the compiler, we gener-
ate three different versions of binaries for the same setup
(e.g., three versions of astar with random sized paddings
of minimum one byte and maximum three bytes). The error
bars in the figure represent the minimum and the maximum
execution times among 15 executions (three binaries × five
5We use the arithmetic mean of the speedup (execution time of the original
system divided by that of the system with additional latency) to compute
the average, or in other words, we are interested in a condition where the
workloads are not fixed and all types of workloads are equally probable on
the target system [20, 21].
runs) and the average of the execution times is represented
as the bar.
• Performance of theOpportunistic and Full Insertion
Policies with CFORM Instructions. Figure 11 presents the
slowdown incurred by three set of strategies: full insertion
policy (with random sized security bytes) without CFORM in-
structions, the opportunistic policy with CFORM instructions,
and the full insertion policy with CFORM instructions. Since
the first strategy does not execute CFORM instructions it does
not offer any security coverage, but is shown as a reference to
showcase the performance breakdown of the third strategy
(cache underutilization vs. executing CFORM instructions).
First, we focus on the three variants of the first strategy,
which is shown in the three left most bars. We can see that
different sizes of random sized security bytes does not make a
large difference in terms of performance. The average slow-
down of the three variants for the policy without CFORM
instructions are 5.5%, 5.6% and 6.5%, respectively. This can
be backed up by our results shown in Figure 4, where the av-
erage slowdowns of additional padding of two, three and four
bytes ranges from 5.4% to 6.2%. Therefore in order to achieve
higher security coverage without losing performance, using
a random sized bytes of, minimum of one byte and maxi-
mum of seven bytes, is promising. When we focus on in-
dividual benchmarks, we can see that a few benchmarks
including h264ref, mcf, milc and omnetpp incur noticeable
slowdowns (ranging from 15.4% to 24.3%).
Next, we examine the opportunistic policy with CFORM in-
structions, which is shown in the middle (fourth) bar. Since
this strategy does not add any additional security bytes, the
overheads are purely due to the work required to setup and
11
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
execute CFORM instructions. The average slowdown of this
policy is 7.9%. There are benchmarks which encounter a
slowdown of more than 10%, namely gobmk, h264ref and
perlbench. The overheads are due to frequent allocations
and deallocations made during program execution, where
the programs have to calculate and execute CFORM instruc-
tions upon every event (since every compound data type will
be/was califormed). For instance perlbench is notorious for
being malloc-intensive, and reported as such elsewhere [2].
Lastly the third policy, the full insertion policy with CFORM
instructions, offers the highest security coverage in Cali-
forms based system with the highest average slowdown of
14.0% (with the random sized security bytes of maximum
seven bytes). Nearly half (seven out of 16) the benchmarks
encounter a slowdown of more than 10%, which might not
be suitable for performance-critical environments, and thus
the user might want to consider the use of the following
intelligent insertion policy.
• Performance of the Intelligent Insertion Policy with
CFORM Instructions. Figure 12 shows the slowdowns of the
intelligent insertion policy with random sized security bytes
(with and without CFORM instructions, in the same spirit as
Figure 11). First we focus on the strategy without executing
CFORM instructions (the three bars on the left). The perfor-
mance trend is similar such that the three variants with
different random sizes have little performance difference,
where the average slowdown is 0.2% with the random sized
security bytes of maximum seven bytes. We can see that
none of the programs incurs a slowdown of greater than
5%. Finally with CFORM instructions (three bars on the right),
gobmk and perlbench have slowdowns of greater than 5%
(16.1% for gobmk and 7.2% for perlbench). The average slow-
down is 1.5%, where considering its security coverage and
performance overheads the intelligent policy might be the
most practical option for many environments.
9 Related Work
Implementations of various safety mechanisms in hardware
were very popular in the 70–90s, introducing crucial legacy
techniques such as capabilities, segmentation and virtual
memory. Subsequently, the focus shifted towards scalability
and performance until the last decade, when security saw a
revival in interest. In this section, we only focus on the latter
group of modern hardware based security techniques, and
compare them to Califorms. Previous hardware solutions
in this domain can be broadly categorized into the follow-
ing three classes: disjoint metadata whitelisting, cojoined
metadata whitelisting and inlined metadata blacklisting, as
presented in Figure 13.
• Disjoint Metadata Whitelisting. This class of tech-
niques, also called base and bounds, attaches bounds
metadata with every pointer, bounding the region of mem-
ory they can legitimately dereference (see Figure 13(a)).
Hardbound [8] was the first hardware proposal to provide
spatial memory safety using this mechanism. Intel MPX [4]
is similar, but also introduces explicit architectural interface
(registers and instructions) for managing bounds informa-
tion. Temporal safety was introduced to this scheme by
storing an additional “version” information along with the
pointer metadata and verifying that no stale versions are
ever retrieved [22, 23]. BOGO [24] adds temporal safety to
MPX by invalidating all pointers to freed regions in MPX’s
lookup table. Introduced about 35 years ago in commercial
chips like Intel 432 and IBM System/38, the CHERI [9]
revived capability based architectures. It has similar bounds-
checking guarantees, in addition to having other metadata
fields pertaining to permissions, etc6. PUMP [11], on the
other hand, is a general-purpose framework for metadata
propagation, and can be used for propagating pointer
bounds.
Typically, per pointer metadata is stored separately from
the pointer in a shadow memory region, in order to maintain
legacy pointer layout assumptions. Thus, although metadata
storage overhead scales according to the number of pointers
in principle, techniques generally reserve a fixed chunk of
memory for easy lookup. Owing to this disjoint nature, meta-
data access therefore requires additional memory operations,
which individual proposals seek to minimize with caching
and other optimizations. Regardless, disjoint metadata in-
troduces atomicity concerns potentially resulting in false
positives and negatives or complicating coherence designs at
the least (e.g., MPX is not thread-safe). Explicit specification
of bounds per pointer also allows bounds-narrowing in prin-
ciple, wherein pointer bounds can be tailored to protect indi-
vidual elements in a composite memory object (for instance,
when passing the pointer to an element to another func-
tion). However, commercial compilers do not support this
feature for MPX due to the complexity of compiler analyses
required. Furthermore, compatibility issues with untreated
modules (unprotected libraries, for instance) also introduces
real-world deployability concerns for these techniques. For
instance MPX drops its bounds when protected pointers are
modified by unprotected modules, while CHERI does not
support it at all. MPX additionally makes bounds checking
explicit, thus introducing a marginal computational over-
head to bounds management as well.
• CojoinedMetadataWhitelisting.Originally introduced
in the IBM System/360 mainframes, this mechanism assigns
a “color” to memory chunks as well as pointers. As such, the
runtime check for access validity simply consists of com-
paring the colors of the pointer and accessed memory (see
Figure 13(b)).
6A recent version of CHERI [10], however, manages to compress metadata
to 128 bits and change pointer layout to store it with the pointer value
(i.e., implementing base and bounds as cojoined metadata whitelisting),
accordingly introducing instructions to manipulate them specifically.
12
Practical Byte-Granular Memory Blacklisting using Califorms
Pointer
Buffer✘
✔
Begin
End
Begin
End
(a) Disjoint metadata
whitelisting.
Pointer
Buffer_A✘
✔
Buffer_B
A
B
A
Color
Tags
(b) Cojoined metadata
whitelisting.
Pointer
Buffer
Tripwire
Tripwire
✘
✔
(c) Inlined meta-
data blacklisting.
Figure 13. Three main classes of hardware solutions for
memory safety.
This technique is currently commercially deployed by
SPARC ADI [3],7 which refactors unused higher order bits
in pointers to store the color. Color associated with memory
is stored in the ECC bits while in memory, and dedicated per
line metadata bits while in cache. Due to the latter feature,
metadata storage does not occupy any additional memory
in the program’s address space.8 Additionally, since meta-
data bits are acquired along with concomitant data, extra
memory operations are obviated. For the same reason, it is
also compatible with unprotected modules since the checks
are implicit as well. Temporal safety is trivially achieved by
assigning a different color when memory regions are reused.
However, intra-object protection or bounds-narrowing is
not supported as there is no means for “overlapping” colors.
Furthermore, protection is also dependent on the number
of metadata bits employed, since it determines the number
of colors that can be assigned. So, while color reuse allows
ADI to scale and limit metadata storage overhead, it can also
be exploited by this vector. Another disadvantage of this
technique, specifically due to inlining metadata in pointers,
is that it only supports 64-bit architectures. Narrower point-
ers would not have enough spare bits to accommodate color
information.
• Inlined Metadata Blacklisting. Another line of work,
also referred to as tripwires, aims to detect overflows by sim-
ply blacklisting a patch of memory on either side of a buffer,
and flagging accesses to this patch (see Figure 13(c)). This is
very similar to contemporary canary design [30], but there
are a few critical differences. First, canaries only detect over-
writes, not overreads. Second, hardware tripwires trigger
instantaneously, whereas canaries need to be periodically
checked for integrity, providing a period of attack to time
of use window. Finally, unlike hardware tripwires, canary
values can be leaked or tampered, and thus mimicked.
7ARM has a similar upcoming Memory Tagging [25] feature, whose imple-
mentation details are unclear, as of this work.
8When a memory is swapped, color bits are copied into memory by the OS
however.
Proposal Protection Intra- Binary TemporalGranularity Object Composability Safety
Hardbound [8] Byte ✓∗ ✗ ✗
Watchdog [22] Byte ✓∗ ✗ ✓
WatchdogLite [23] Byte ✓∗ ✗ ✓
Intel MPX [4] Byte ✓∗ ✗‡ ✗
BOGO [24] Byte ✓∗ ✗‡ ✓
PUMP [11] Word ✗ ✓ ✓
CHERI [9] Byte ✗† ✗ ✗
CHERI concentrate [10] Byte ✗† ✗ ✗
SPARC ADI [3] Cache line ✗ ✓ ✓§
SafeMem [26] Cache line ✗ ✓ ✗
REST [27] 8–64B ✗ ✓ ✓¶
Califorms Byte ✓ ✓ ✓¶
Table 4. Security comparison against previous hardware
techniques. ∗Achieved with bounds narrowing. †Although
the hardware supports bounds narrowing, CHERI foregoes it
since doing so compromises capability logic [28].‡Execution
compatible, but protection dropped when external modules
modify pointer. §Limited to 13 tags. ¶Allocator should ran-
domize allocation predictability.
SafeMem [26] implements tripwires by repurposing ECC
bits in memory to markmemory regions invalid, thus trading
off reliability for security. On processors supporting specula-
tive execution, however, it might be possible to speculatively
fetch blacklisted lines into the cache without triggering a
faulty memory exception. Unless these lines are flushed im-
mediately after, SafeMem’s blacklisting feature can be triv-
ially bypassed. Alternatively, REST [27] achieves the same by
storing a predetermined large random number, in the form
of a 64B tokens, in the memory to be blacklisted. Violations
are detected by comparing cache lines with the token when
they are fetched. REST provides temporal safety by quaran-
tining freed memory, and not reusing them for subsequent
allocations. Compatibility with unprotected modules is eas-
ily achieved as well, since tokens are part of the program’s
address space and all access are implicitly checked. However,
intra-object safety was not supported by REST owing to the
large memory overhead such heavy usage of tokens would
entail.
Since it operates on the principle of detecting memory
accesses to security bytes, which are in turn stored along
with program data, Califorms belongs to the inlinedmetadata
class of defenses. However, it is different from other works
in the class in one key aspect — granularity. While both
REST and SafeMem blacklisted at the cache line granularity,
Califorms does so at the byte granularity. It is this property
that enables us to provide intra-object safety with negligible
performance and memory overheads, unlike previous work
in the area. For inter-object spatial safety and temporal safety,
we employ the same design principles as REST. Hence, our
safety guarantees are a strict superset of those provided by
previous schemes in this class (spatial safety by blacklisting
and temporal safety by quarantining).
13
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
Proposal Metadata Memory Performance MainOverhead Overhead Overhead Operations
Hardbound [8] 0–2 words per ptr, ∝∼ # of ptrs and prog memory footprint ∝∼ # of ptr derefs
1–2 mem ref for bounds (may be cached),
4b per word check µops.
Watchdog [22] 4 words per ptr ∝∼ # of ptrs and allocations ∝∼ # of ptr derefs
1–3 mem ref for bounds (may be cached),
check µops.
WatchdogLite [23] 4 words per ptr ∝∼ # of ptrs and allocations ∝∼ # of ptr ops
1–3 mem ref for bounds (may be cached),
check & propagate insns.
Intel MPX [4] 2 words per ptr ∝∼ # of ptrs ∝∼ # of ptr derefs
2+ mem ref for bounds (may be cached),
check & propagate insns.
BOGO [24] 2 words per ptr ∝∼ # of ptrs ∝∼ # of ptr derefs
MPX ops + ptr miss exception handling,
page permission mods.
PUMP [11] 64b per cache line ∝∼ Prog memory footprint ∝∼ # of ptr ops
1 mem ref for tags, may be cached,
fetch and chk rules; propagate tags.
CHERI [9] 256b per ptr ∝∼ # of ptrs and physical mem ∝∼ # of ptr ops
1+ mem ref for capability (may be cached),
capability management insns.
CHERI concentrate [10] Ptr size is 2x ∝∼ # of ptrs ∝∼ # of ptr ops
Wide ptr load (may be cached),
capability management insns.
SPARC ADI [3] 4b per cache line ∝∼ Prog memory footprint ∝∼ # of tag (un)set ops (Un)set tag.
SafeMem [26] 2x blacklisted memory ∝∼ Blacklisted memory ∝∼ # of ECC (un)set ops Syscall to scramble ECC, copy data content.
REST [27] 8–64B token ∝∼ Blacklisted memory ∝∼ # of arm/disarm insns Execute arm/disarm insns.
Califorms Byte granular security byte ∝∼ Blacklisted memory ∝∼ # of CFORM insns. Execute CFORM insns.
Table 5. Performance comparison against previous hardware techniques.
Proposal Core Caches/TLB Memory Software
Hardbound [8]
µop injection & logic for ptr meta,
Tag cache and its TLB N/A Compiler & allocator annotates ptr metaextend reg file and data path to
propagate ptr meta
Watchdog [22]
µop injection & logic for ptr meta,
Ptr lock cache N/A Compiler & allocator annotates ptr metaextend reg file and data path to
propagate ptr meta
WatchdogLite [23] N/A N/A N/A Compiler & allocator annotates ptrs,compiler inserts meta propagation and check insns
Intel MPX [4] Unknown (closed platform [29], design likely similar to Hardbound) Compiler & allocator annotates ptrs,compiler inserts meta propagation and check insns
BOGO [24] Unknown (closed platform [29], design likely similar to Hardbound) MPX mods + kernel mods for bounds pageright management
PUMP [11]
Extend all data units by tag width,
Rule cache N/A Compiler & allocator (un)sets memory, tag ptrsmodify pipeline stages for tag checks,
new miss handler
CHERI [9] Capability reg file, coprocessor Capability caches N/A Compiler & allocator annotates ptrs,integrated with pipeline compiler inserts meta propagation and check insns
CHERI concentrate [10] Modify pipeline to integrate ptr checks N/A N/A Compiler & allocator annotates ptrs,compiler inserts meta propagation and check insns
SPARC ADI [3] Unknown (closed platform) Compiler & allocator (un)sets memory, tag ptrs
SafeMem [26] N/A N/A Repurposes ECC bits
REST [27] N/A 1–8b per L1D line, N/A Compiler & allocator (un)sets tags,1 comparator allocator randomizes allocation order/placement
Califorms N/A 8b per L1D line, Use unused ECC bits Compiler & allocator mods to (un)set tags,1b per L2/L3 line compiler inserts intra-object spacing
Table 6. Comparison of implementation complexity among previous hardware techniques.
9.1 Comparison with Califorms
Tables 4, 5, and 6 summarize the performance, security, and
implementation characteristics of the hardware based mem-
ory safety techniques discussed in this section respectively.
Califorms has the advantage of requiring simpler hardware
modifications and being faster than disjoint metadata based
whitelisting systems. The hardware savings mainly stem
from the fact that our metadata resides with program data;
it does not require explicit propagation while additionally
obviating all lookup logic. This significantly reduces our
14
Practical Byte-Granular Memory Blacklisting using Califorms
design’s implementation costs. Califorms also has lower per-
formance and energy overheads since it neither requires
multiple memory accesses, nor does it incur any significant
checking costs. However, unlike them, Califorms can be by-
passed if accesses to security bytes can be avoided (further
discussed in Section 7). This safety-vs.-complexity tradeoff is
critical to deployability and we argue that our design point
is more practical. This is because designers have to contend
with integrating these features to already complicated pro-
cessor designs, without introducing additional bugs while
also keeping the functionality of legacy software intact. This
is a hard balance to strike [4].
On the other hand, ideal cojoined metadata mechanisms
would have comparable slowdowns and similar compiler
requirements. However practical implementations like ADI
exhibits some crucial differences from the ideal.
• It is limited to 64-bit architectures, which excludes a large
portion of embedded and IoT processors that operate on
32-bit or narrower platforms.
• It has finite number of colors since available tag bits are
limited — ADI supports 13 colors with 4 tag bits. This is
important because reusing colors proportionally reduces
the safety guarantees of these systems in the event of a
collision.
• It operates at the coarse granularity of cache line width,
and hence, is not practically applicable for intra-object
safety.
On the contrary, Califorms is agnostic of architecture
width and is, hence, better suited for deployment over a
more diverse device environment. In terms of safety, col-
lision is not an issue for our design either. Hence, unlike
cojoined metadata systems, our security does not scale in-
versely with the number of allocations in the program (see
Section 7 for a detailed discussion). Finally, our fine-grained
protection also makes us suitable for intra-object memory
safety which is a non-trivial threat in modern security [31].
10 Conclusion
Califorms is a hardware primitive which allows blacklist-
ing a memory location at byte granularity with low area
and performance overhead. A key observation behind Cali-
forms is that a blacklisted region need not store useful data
separately in most cases, since we can utilize byte-granular,
existing or added, space present between object elements to
store the metadata. This in-place compact data structure also
avoids additional operations for extraneously fetching the
metadata making it very performant in comparison. Further,
by changing how data is stored within a cache line we are
able to reduce the hardware area overheads substantially.
Subsequently, if the processor accesses a califormed byte (or
a security byte), due to programming errors or malicious
attempts, it reports a privileged exception.
To provide memory safety, we use Califorms to insert se-
curity bytes within data structures (e.g., between fields of
a struct) upon memory allocation and clear them on deal-
location. Notably, by doing so, Califorms can even detect
intra-object overflows, which is one of the prominent open
problems in memory safety, despite decades of research in
this area. We also described the necessary compiler and
software support for providing memory safety using Cali-
forms. To the best of our knowledge, this is the first hardware
primitive which makes in place byte-granular blacklisting
practical.
References
[1] D. Weston and M. Miller. Windows 10 mitigation improvements. Black
Hat USA, 2016.
[2] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and
Dmitry Vyukov. AddressSanitizer: a fast address sanity checker. In
USENIX ATC ’12: Proceedings of the 2012 USENIX Annual Technical
Conference, pages 28–28, June 2012.
[3] Hardware-assisted checking using Silicon Secured Memory (SSM).
https://docs.oracle.com/cd/E37069_01/html/E37085/gphwb.html, 2015.
[4] Oleksii Oleksenko, Dmitrii Kuvaiskii, Pramod Bhatotia, Pascal Felber,
and Christof Fetzer. Intel MPX explained: a cross-layer analysis of the
Intel MPX system stack. Proceedings of the ACM on Measurement and
Analysis of Computing Systems, 2(2):28:1–28:30, 2018.
[5] Dokyung Song, Julian Lettner, Prabhu Rajasekaran, Yeoul Na, Stijn
Volckaert, Per Larsen, and Michael Franz. SoK: sanitizing for security.
In IEEE S&P ’19: Proceedings of the 40th IEEE Symposium on Security
and Privacy, May 2019.
[6] Gregory J Duck and Roland H C Yap. EffectiveSan: type and memory
error detection using dynamically typed C/C++. In PLDI ’18: Proceed-
ings of the 39th ACM SIGPLAN Conference on Programming Language
Design and Implementation, pages 181–195, June 2018.
[7] Kostya Serebryany, Evgenii Stepanov, Aleksey Shlyapnikov, Vlad
Tsyrklevich, and Dmitry Vyukov. Memory tagging and how it im-
proves C/C++ memory safety. arXiv.org, February 2018.
[8] Joe Devietti, Colin Blundell, Milo M K Martin, and Steve Zdancewic.
HardBound: architectural support for spatial safety of the C program-
ming language. In ASPLOS XIII: Proceedings of the 13th International
Conference on Architectural Support for Programming Languages and
Operating Systems, pages 103–114, March 2008.
[9] Jonathan Woodruff, Robert N M Watson, David Chisnall, Simon W
Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G Neu-
mann, Robert Norton, and Michael Roe. The CHERI capability model:
revisiting RISC in an age of risk. In ISCA ’14: Proceedings of the 41st
International Symposium on Computer Architecture, pages 457–468,
June 2014.
[10] Jonathan Woodruff, Alexandre Joannou, Hongyan Xia, Anthony
Fox, Robert Norton, David Chisnall, Brooks Davis, Khilan Gudka,
Nathaniel W Filardo, , A Theodore Markettos, Michael Roe, Peter G
Neumann, Robert NicholasMaxwellWatson, and SimonMoore. CHERI
concentrate: practical compressed capabilities. IEEE Transactions on
Computers, pages 1–1, April 2019.
[11] Udit Dhawan, Catalin Hritcu, Raphael Rubin, Nikos Vasilakis, Silviu
Chiricescu, Jonathan M Smith, Thomas F Knight, Jr, Benjamin C Pierce,
and Andre DeHon. Architectural support for software-defined meta-
data processing. In ASPLOS ’15: Proceedings of the 20th International
Conference on Architectural Support for Programming Languages and
Operating Systems, pages 487–502, March 2015.
[12] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Ham-
burg, Moritz Lipp, StefanMangard, Thomas Prescher, Michael Schwarz,
and Yuval Yarom. Spectre attacks: exploiting speculative execution.
15
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
In IEEE S&P ’19: Proceedings of the 40th IEEE Symposium on Security
and Privacy, May 2019.
[13] Alyssa Milburn, Herbert Bos, and Cristiano Giuffrida. SafeInit: compre-
hensive and practical mitigation of uninitialized read vulnerabilities.
In NDSS ’17: Proceedings of the 2017 Network and Distributed System
Security Symposium, pages 1–15, February 2017.
[14] Introduce struct layout randomization plugin. https://lkml.org/lkml/
2017/5/26/558, May 2017.
[15] Randomizing structure layout. https://lwn.net/Articles/722293/, May
2017.
[16] Andrea Bittau, Adam Belay, Ali Mashtizadeh, David Mazi e res, and
Dan Boneh. Hacking blind. In IEEE S&P ’14: Proceedings of the 35th
IEEE Symposium on Security and Privacy, pages 227–242, May 2014.
[17] Daniel Sanchez and Christos Kozyrakis. ZSim: fast and accurate mi-
croarchitectural simulation of thousand-core systems. In ISCA ’13:
Proceedings of the 40th International Symposium on Computer Architec-
ture, pages 475–486, June 2013.
[18] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun,
and Anand Karunanidhi. Pinpointing representative portions of large
Intel® Itanium® programs with dynamic instrumentation. MICRO-37:
Proceedings of the 37th IEEE/ACM International Symposium on Microar-
chitecture, pages 81–92, December 2004.
[19] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,
Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel-
wood. Pin: building customized program analysis tools with dynamic
instrumentation. In PLDI ’05: Proceedings of the 26th ACM SIGPLAN
Conference on Programming Language Design and Implementation,
pages 190–200, June 2005.
[20] Lieven Eeckhout. Computer architecture performance evaluation meth-
ods. Morgan & Claypool Publishers, 1st edition, 2010.
[21] Lizy Kurian John. More on finding a single number to indicate overall
performance of a benchmark suite. ACM SIGARCH Computer Archi-
tecture News, 32(1):3–8, March 2004.
[22] Santosh Nagarakatte, Milo M K Martin, and Steve Zdancewic. Watch-
dog: hardware for safe and secure manual memory management and
full memory safety. In ISCA ’12: Proceedings of the 39th International
Symposium on Computer Architecture, pages 189–200, June 2012.
[23] Santosh Nagarakatte, Milo M K Martin, and Steve Zdancewic. Watch-
dogLite: hardware-accelerated compiler-based pointer checking. In
CGO ’14: Proceedings of the 12th IEEE/ACM International Symposium
on Code Generation and Optimization, pages 175–184, February 2014.
[24] Tong Zhang, Dongyoon Lee, and Changhee Jung. BOGO: buy spa-
tial memory safety, get temporal memory safety (almost) free. In
ASPLOS ’19: Proceedings of the 24th International Conference on Archi-
tectural Support for Programming Languages and Operating Systems,
pages 631–644, April 2019.
[25] ARM A64 instruction set architecture for ARMv8-A architecture
profile. https://static.docs.arm.com/ddi0596/a/DDI_0596_ARM_a64_
instruction_set_architecture.pdf, 2018.
[26] Feng Qin, Shan Lu, and Yuanyuan Zhou. SafeMem: exploiting ECC-
memory for detecting memory leaks and memory corruption during
production runs. InHPCA ’05: Proceedings of the IEEE 11th International
Symposium on High Performance Computer Architecture, pages 291–302,
February 2005.
[27] Kanad Sinha and Simha Sethumadhavan. Practical memory safety with
REST. In ISCA ’18: Proceedings of the 45th International Symposium on
Computer Architecture, pages 600–611, June 2018.
[28] Brooks Davis, Khilan Gudka, Alexandre Joannou, Ben Laurie,
A Theodore Markettos, J Edward Maste, Alfredo Mazzinghi, Ed-
ward Tomasz Napierala, Robert M Norton, Michael Roe, Peter Sewell,
Robert N M Watson, Stacey Son, Jonathan Woodruff, Alexander
Richardson, Peter G Neumann, Simon W Moore, John Baldwin, David
Chisnall, James Clarke, and Nathaniel Wesley Filardo. Cheriabi: en-
forcing valid pointer provenance and minimizing pointer privilege in
the POSIX C run-time environment. In ASPLOS ’19: Proceedings of the
24th International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 379–393, April 2019.
[29] Junjing Shi, Qin Long, Liming Gao, Michael A. Rothman, and Vin-
cent J. Zimmer. Methods and apparatus to protect memory from
buffer overflow and/or underflow, April 2018. International patent
WO/2018/176339.
[30] Crispin Cowan, Calton Pu, Dave Maier, Heather Hintony, Jonathan
Walpole, Peat Bakke, Steve Beattie, Aaron Grier, PerryWagle, and Qian
Zhang. StackGuard: automatic adaptive detection and prevention of
buffer-overflow attacks. In USENIX Security ’98: Proceedings of the 7th
USENIX Security Symposium, pages 1–15, January 1998.
[31] Kangjie Lu, Chengyu Song, Taesoo Kim, and Wenke Lee. UniSan:
proactive kernel memory initialization to eliminate data leakages. In
CCS ’16: Proceedings of the 23rd ACM SIGSAC Conference on Computer
and Communications Security, pages 920–932, October 2016.
Appendices
A CALIFORMS Variants
Here we present two other variants of califorms-bitvector
(designed for the L1 cache) which have less storage over-
head (but with additional complexity) compared to the one
presented in Section 5.1.
• Califorms-4B. The first variant has 4B of additional stor-
age per 64B cache line. This Califorms stores the bit vector
within a security byte (illustrated in Figure 14). Since a single
byte bit vector (which can be stored in one security byte) can
represent the state for 8B of data, we divide the 64B cache
line into eight 8B chunks. If there is at least one security byte
within an 8B chunk, we use one of those bytes to store the
bit vector which represents the state of the chunk. For each
chunk, we need to add four additional bits of storage. One
bit to represent whether the chunk is califormed (contains a
security byte), and three bits to specify which byte within
the chunk stores the bit vector. Therefore, the additional
storage is 4B (4-bit × 8 chunks) or 6.25% per 64B cache line.
The figure highlights the chunk [0] being califormed where
the corresponding bit vector is stored in byte [1].
• Califorms-1B.We can further reduce the metadata over-
head by restricting where we store the bit vector within the
chunk (illustrated in Figure 15). The idea is to always store
the bit vector in a fixed location (the 0th byte in the figure,
or the header byte; similar idea used in califorms-sentinel). If
the 0th byte is a security byte this works without additional
modification. However if the 0th byte is not a security byte,
we need to save its original value somewhere else so that
we can retrieve it when required. For this purpose, we use
one of the security bytes (the last security byte is chosen in
the figure). This way we can eliminate three bits of metadata
per chunk to address the byte which contains the bit vector.
Therefore, the additional storage is 1B or 1.56% per 64B cache
line. Similar with Figure 14, the figure highlights the chunk
[0] being califormed (where the corresponding bit vector is
16
Practical Byte-Granular Memory Blacklisting using Califorms
[0] [1] [7]
4bit
4B
1 1
Chunk califormed?
Byte addr
Add’l storage
[0] [1] [7]
64B
Cache line (data)
8B
[0] [1] [7]
1B
0 1 1
Security byte?
1bit
1bit 3bit
Figure 14. Califorms-bitvector that stores a bit vector inside
security byte locations. The additional metadata (4-bit per
8B) specifies if the corresponding chunk contains a security
byte, and if it does, where in the chunk the bit vector is stored
in. HW overhead of 4B per 64B cache line.
[0] [1] [7]
64B
Cache line (data)
8B
[0] [1] [7]
1B
0 1 1
Security byte?
[0]
1bit
1B
Chunk califormed?
Add’l storage
[1] [7]
1
Contains original value of [0]
(If [0] is not a security byte)
1bit
Figure 15. Califorms-bitvector that stores a bit vector in the
header (0th) byte of the chunk. If the header byte is normal
data (not a security byte), its original value is stored in the
last security byte. The additional metadata (1-bit per 8B)
specifies if the corresponding chunk contains a security byte.
HW overhead of 1B per 64B cache line.
stored in the first byte) and the original value of byte [0]
stored in the last security byte, byte [7], within the chunk.
• Evaluation.We perform the same VLSI evaluation shown
in Section 8.1 for the two additional califorms-bitvector intro-
duced in this section. Table 7 presents the results. As we can
see, califorms-bitvector with 4B and 1B overheads incur 47%
and 20% of extra delay, respectively, upon L1 hit compared to
the califorms-bitvector with 8B overhead (49% and 22% addi-
tional delay compared to the baseline L1 data cache without
Califorms). Also, both califorms-bitvector add almost the
same overheads upon spill and fill operations (compared to
the califorms-bitvector with 8B overhead) which are 9% delay
and 30% energy for spill, and 34% delay and 17% energy for fill
operations. Our evaluation reveals that califorms-bitvector
with 1B overhead outperforms the other with 4B overhead
both in terms of additional storage and access latency/energy.
The reason is due to the design restriction of fixing the lo-
cation of the header byte which allows faster lookup of the
bit vector in the security byte. Califorms-bitvector with 1B
overhead can be a good alternative (to the one presented in
Section 5) in domains where area budget is more tight and/or
less performance critical; e.g., embedded or IoT systems.
B Handling SIMD/Vector Instructions
As we have discussed in Section 6.2, precise loads and stores
(along with the whitelisting capability) allow us to detect
access violation upon memory instructions. However, there
are certain class of instructions where issuing precise mem-
ory instructions may noticeably degrade performance. One
class we can imagine is the SIMD/vector instructions where
vector loads read a very wide (e.g., 512 bits for Intel AVX-512)
word into the SIMD/vector register with a single instruction.
For such instructions we can (1) operate the same way with
regular loads by issuing precise loads (e.g., by using vector
gather instructions with appropriate masks), (2) issue wide
vector loads as is and trigger an exception whenever the
vector load touches a security byte; this may introduce false
positives but in reality data structures used by SIMD/vector
instructions are unlikely to contain security bytes, or (3)
add one bit per byte in the SIMD/vector registers so that we
can propagate the security byte information, and trigger an
exception whenever SIMD/vector instructions operates on
a security byte. Investigating these alternatives are left for
future work.
17
H. Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan
Table 7. Area, delay and power overheads of the three L1 Califorms (GE represents gate equivalent). The top two rows are
presented in Table 2 and are shown here again for reference. Califorms-bitvector with 4B and 1B overheads incur 47% and 20%
extra delay, respectively, upon L1 hit compared to califorms-bitvector with 8B overhead. Also the two Califorms add 9% delay
and 30% energy upon spill and 34% delay and 17% energy upon fill.
Design Main synthesis results L1 overheads Fill overheads Spill overheads
Name Area (GE) Delay (ns) Power (mW ) Area (%) Delay(%) Power (%) Area (GE) Delay (ns) Power (mW ) Area (GE) Delay (ns) Power (mW )
Baseline 347,329.19 1.62 15.84 — — — — — — — — —
Califorms-8B 412,263.87 1.65 16.17 18.69 1.85 2.12 8,957.16 1.43 0.18 34,561.80 5.50 0.52
Califorms-4B 370,972.35 2.42 17.95 6.80 49.38 11.00 9,770.04 1.92 0.21 35,775.36 5.99 0.68
Califorms-1B 356,694.82 1.98 16.00 2.69 22.22 1.06 10,223.28 1.94 0.22 35,958.24 5.99 0.67
18
