University of Central Florida

STARS
Electronic Theses and Dissertations, 20202021

FPGA-Augmented Secure Crash-Consistent Non-Volatile Memory
Yu Zou
University of Central Florida

Part of the Computer Engineering Commons, and the Information Security Commons

Find similar works at: https://stars.library.ucf.edu/etd2020
University of Central Florida Libraries http://library.ucf.edu
This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted
for inclusion in Electronic Theses and Dissertations, 2020- by an authorized administrator of STARS. For more
information, please contact STARS@ucf.edu.

STARS Citation
Zou, Yu, "FPGA-Augmented Secure Crash-Consistent Non-Volatile Memory" (2021). Electronic Theses and
Dissertations, 2020-. 792.
https://stars.library.ucf.edu/etd2020/792

FPGA-AUGMENTED SECURE CRASH-CONSISTENT NON-VOLATILE MEMORY

by

YU ZOU
B.S. Beihang University, 2015
M.S. University of Florida, 2017

A dissertation submitted in partial fulfilment of the requirements
for the degree of Doctor of Philosophy
in the Department of Electrical and Computer Engineering
in the College of Engineering and Computer Science
at the University of Central Florida
Orlando, Florida

Summer Term
2021
Major Professor: Mingjie Lin

c 2021 Yu Zou

ii

ABSTRACT

Emerging byte-addressable Non-Volatile Memory (NVM) technology, although promising superior memory density and ultra-low energy consumption, poses unique challenges to achieving persistent data privacy and computing security, both of which are critically important to the embedded and IoT applications. Specifically, to successfully restore NVMs to their working states
after unexpected system crashes or power failure, maintaining and recovering all the necessary
security-related metadata can severely increase memory traffic, degrade runtime performance, exacerbate write endurance problem, and demand costly hardware changes to off-the-shelf processors. In this thesis, we summarize and expand upon two of our innovative works, ARES and
HERMES, to design a new FPGA-assisted processor-transparent security mechanism aiming at
efficiently and effectively achieving all three aspects of a security triad—confidentiality, integrity,
and recoverability—in modern embedded computing. Given the growing prominence of CPUFPGA heterogeneous computing architectures, ARES leverages FPGA’s hardware reconfigurability to offload performance-critical and security-related functions to the programmable hardware
without microprocessors’ involvement. In particular, recognizing that the traditional Merkle tree
caching scheme cannot fully exploit FPGA’s parallelism due to its sequential and recursive function calls, ARES proposed a new Merkle tree cache architecture and a novel Merkle tree scheme
which flattened and reorganized the computation in the traditional Merkle tree verification and
update processes to fully exploit the parallel cache ports and to fully pipeline time-consuming
hashing operations. To further optimize throughput of BMT operations, HERMES proposed an
optimally efficient dataflow architecture by processing multiple outstanding counter requests simultaneously. Specifically, HERMES explored and addressed three technical challenges when
exploiting task-level parallelism of BMT and proposed a speculative execution approach with both
low latency and high throughput.
iii

For Mom

iv

ACKNOWLEDGMENTS

Thank my advisor Dr. Mingjie Lin for the four-year advisory and cooperation. From an entrylevel graduate student to a PhD candidate with the capability of deep digging into a research field,
he gave me countless patient helps, e.g. how to review literatures, how to establish a knowledge
system, how to come up with valuable research ideas. Without him, I could not have gone so far.
Thank my academic committee members, Dr. Amro Awad, Dr. Jun Wang, Dr. Rickard Ewetz, and
Dr. Wei Zhang for being willing to serve as my academic committee members. Thank Dr. Amro
Awad for always providing some interesting insights from a new perspective which is different
from hardware design.
Thank DARPA for funding me to work as a research assistant and funding the project. This project
is very valuable and I believe all our deliveries for this project will have a great impact in the
hardware security research field.
Thank all my teammates, Rakin, Sanjay, Kazi, Mazen, Chance, and Nick, for all the efforts put into
this project. All teammates worked hard to cooperate on the project and everyone should share the
success.
Thank my girlfriend, Yiming, for always supporting me whenever I felt frustrated, failed, and lost.
She taught me how to manipulate the work-life balance, to relax myself at the right time, and to
learn how to take the responsibility. Love her forever. Thank my family for all the supports so far.
Appreciate every experience on this academic journey. It is tough, but it is worthy.

v

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

CHAPTER 2: RESEARCH BACKGROUND AND RELATED WORK . . . . . . . . . .

7

Bonsai Merkle Tree: Standard Usage And Cache Support . . . . . . . . . . . . . . . . .

7

Security Metadata Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Target Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER 3: ARES: PERSISTENTLY SECURE NVM . . . . . . . . . . . . . . . . . . 11
Motivation And Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Cache-Augmented Bonsai Merkle Tree . . . . . . . . . . . . . . . . . . . . . . . 18
Novel BMT Cache Architecture in ARES . . . . . . . . . . . . . . . . . . 19
Modified BMT Cache Control Policy for ARES . . . . . . . . . . . . . . . 20
vi

New Merkle Tree Scheme for ARES BMT Cache . . . . . . . . . . . . . . 21
FPGA-Accelerated Metadata Recovery . . . . . . . . . . . . . . . . . . . . . . . . 25
Architecture Design And Hardware Implementation . . . . . . . . . . . . . . . . . . . . 29
ARES System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
NVM Memory Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Counter Cache with Recovery Support . . . . . . . . . . . . . . . . . . . . . . . . 30
Memory Verification Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Metadata Recovery Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Experiment And Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Recovery Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

CHAPTER 4: HERMES: HIGH-THROUGHPUT SPECULATIVE BMT OPERATION . . 47
Motivation And Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Architecture Design And Hardware Implementation . . . . . . . . . . . . . . . . . . . . 59
System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Speculative Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Parallel Verification Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Stage Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Experiment And Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Counter Access Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Counter Access Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Speculative Buffer Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
BMT Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Latency Overhead of BMT on Real Applications . . . . . . . . . . . . . . . . . . 73
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

APPENDIX :

PERFORMANCE MODELING OF ARES . . . . . . . . . . . . . . . . . 76
viii

Memory Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Parallel Counter Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Parallel Merkle Tree Rebuilding . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

ix

LIST OF FIGURES

Figure 1.1:

AES counter-mode encryption with split-counter scheme Awad et al. (2016);
Ye et al. (2018). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Figure 1.2:

Challenges of deploying secure persistent NVMs. . . . . . . . . . . . . . .

3

Figure 1.3:

Conventional vs. FPGA-augmented secure architecture. . . . . . . . . . .

5

Figure 2.1:

An example 2-level Bonsai Merkle tree Zubair and Awad (2019). (Root
level and counter level are excluded.) . . . . . . . . . . . . . . . . . . . .

Figure 3.1:

7

The most commonly used Merkle tree scheme Gassend et al. (2003) utilizes a cache and a recursive algorithm to reduce performance overhead of
memory integrity verification. The algorithm is friendly to von-Neumann
devices, such as processors, with sequential execution. Spatial computing
devices, such as FPGA, exploit instruction-level and data-level parallelism
relying on deploying all functionalities on a “flat empty paper” and coordinating with control signals. The theoretically-endless recursive function
call is unfriendly to control logic design and flat-deploying hardware modules, since the datapath is indeterministic, consequently not able to exploit
FPGA potentials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

x

Figure 3.2:

A comparison between the traditional MT scheme and ARES MT scheme
when interacting with cache. (a) Flattening traditional recursive MT scheme
generates a sequential execution hiding all the parallelization opportunities
and consequently wasting clock cycles. (b) ARES decomposes each routine into sub-operations and rearranges operations into barrier-synchronized
stages such that all the non-dependent operations within the same stage can
be executed in parallel/pipeline. RD: read from DRAM; CR: read from
cache; CW: write to cache. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Figure 3.3:

Comparison of read miss in traditional cache and ARES MT cache. (a)
Traditional cache. (b) ARES MT cache. . . . . . . . . . . . . . . . . . . . 17

Figure 3.4:

Cache partitioning for Merkle tree. Different level can have a cache of
different size according to the different reuse distance. . . . . . . . . . . . 19

Figure 3.5:

(a) Counter read (there is at least one level is hit in the cache). (b) Counter
read (all levels miss in the cache). (c) Counter write. . . . . . . . . . . . . 24

Figure 3.6:

Parallel Merkle tree reconstruction exploiting both task-level parallelism
(simultaneously rebuilding sub-trees) and data-level parallelism (pipelining the reconstruction of single MT node). . . . . . . . . . . . . . . . . . . 26

Figure 3.7:

Parallel counter recovery. To recover single counter, a counter block is
read and each minor counter block is recovered. For each minor counter,
the associated data block and ECC are read and decrypted with all the
possible counters to recover. After all the minor counters are recovered,
the recovered counter is written back to memory. . . . . . . . . . . . . . . 28

Figure 3.8:

System overview of ARES.

. . . . . . . . . . . . . . . . . . . . . . . . . 29
xi

Figure 3.9:

Counter cache supporting counter recovery. . . . . . . . . . . . . . . . . . 31

Figure 3.10: Memory Verification Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 3.11: Metadata Recovery Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 3.12: Total execution time of benchmarking applications. . . . . . . . . . . . . . 37
Figure 3.13: Memory access bandwidth in 8-byte granularity. . . . . . . . . . . . . . . 37
Figure 3.14: Memory access bandwidth in 64-byte granularity. . . . . . . . . . . . . . . 37
Figure 3.15: Average latency of data read. . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 3.16: Breakdown of average read latency when counter hits within the cache.
AES and HMAC are executed in parallel when reading a data. . . . . . . . 41
Figure 3.17: Impact of Osiris persisting threshold on runtime performance. . . . . . . . 41
Figure 3.18: Impact of Osiris persisting threshold on number of writes to persistent
memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 3.19: Impact of recovery parallelism on recovery time. . . . . . . . . . . . . . . 42
Figure 3.20: Impact of Osiris persisting threshold on recovery time. . . . . . . . . . . . 43

Figure 4.1:

(a) Blocking BMT implementation: C1 and C2 are updated sequentially.
(b) HERMES BMT exploiting task-level parallelism: C1 and C2 updates
are overlapped. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xii

Figure 4.2:

Computing latency of different security primitives used in ARES. In ARES,
the latency of SHA1 is 160 clock cycles, such that updating a 5-level BMT
requires at least 800 clock cycles, without considering NVM access latency. 48

Figure 4.3:

An example 2-level Bonsai Merkle tree. . . . . . . . . . . . . . . . . . . . 48

Figure 4.4:

(a) Challenge #1: B2 is not verified yet thus are not allowed to 1) persist B2 to NVM after merging the update from the lower level or 2) send
the calculated hash over the updated B20 to upper level, otherwise violating security rules. C4 is not verified thus the update passed from 3rd
level should not be safely merged. (b) Challenge #2: Even though B2 is
cached and trusted, A1 cannot merge the update since C4 is not cached. (c)
Challenge #3: A counter update chain always ends with root level while a
verification chain early terminates when a node is cached. . . . . . . . . . 51

Figure 4.5:

(a) Solution #1: A speculative buffer (SB) and a parallel verification module (PV) are added to verify all levels in parallel and speculative buffer
unverified BMT nodes. (b) Solution #2: PV adapts skips parallel verification if all levels are either cached or pre-buffered in SB. (c) Solution
#3: Each level propagates operation and caching status of current level to
upper level such that a verification chain can early terminate when a node
is cached. PV adaptively adjusts verification logic according to operation
and caching status of levels. . . . . . . . . . . . . . . . . . . . . . . . . . 55

Figure 4.6:

HERMES system overview. By decoupling stages using data buffers, stages
work independently and asynchronously. The system is elastic such that
MT stages can be flexibly extended when processing a Bonsai Merkle tree
of arbitrary height. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xiii

Figure 4.7:

Speculative buffer. V : dirty, U : update, CL: clean, DT : dirty, D: data, I:
index of the 64-byte BMT node at current level. . . . . . . . . . . . . . . . 61

Figure 4.8:

Resource layout of HERMES. BMT caches are marked red, pipelined
BMT controller is marked blue, benchmark module is marked yellow, and
all other components are marked green. . . . . . . . . . . . . . . . . . . . 67

Figure 4.9:

Normalized total resource utilization of different implementations. . . . . . 68

Figure 4.10: Bandwidth comparison of HERMES, ARES (baseline), and insecure system at different access stride. . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 4.11: Latency analysis of HERMES and ARES on different caching status. In
the figure MTL indicates required BMT nodes at/above level L are already
cached, while all the BMT nodes below level L are not cached yet (L is
zero-based). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 4.12: Speculative buffer hit rate at different access strides. . . . . . . . . . . . . 72
Figure 4.13: Cache hit rate of lower BMT levels at different access strides. . . . . . . . 73

xiv

LIST OF TABLES

Table 2.1:

Recovery/Runtime Failure Due To Failed Metadata Persistence Freij et al.
(2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Table 3.1:

Comparison with Other Existing Works . . . . . . . . . . . . . . . . . . . 12

Table 3.2:

Traditional Write-Back Cache vs. ARES Merkle Tree Cache . . . . . . . . 20

Table 3.3:

Common Sources And Fixes of ECC Mismatch Ye et al. (2018) . . . . . . 25

Table 3.4:

Metadata Ranges for Protecting 128MB Data. . . . . . . . . . . . . . . . . 35

Table 3.5:

Composition of ARES Benchmarking Suite (Memory-Intensive Applications). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Table 3.6:

Resource Utilization of Hardware Components . . . . . . . . . . . . . . . 36

Table 4.1:

Each level’s update behavior depends on caching status of current level
and all lower levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Table 4.2:

Cache configuration and metadata setup used in the experiment. . . . . . . 67

Table 4.3:

Resource Utilization of Different Hardware Components . . . . . . . . . . 68

xv

CHAPTER 1: INTRODUCTION

Propelled by the rapid surge of artificial intelligence (AI) and machine learning (ML), modern embedded computing system (e.g., IoT and edge devices) becomes increasingly more important platforms to run data-intensive applications due to ubiquity and portability. However, unlike most highperformance computing (HPC) systems equipped with exorbitant computing resources, embedded
computing systems usually have constrained energy, computing power, and memory/storage space,
therefore critically demanding low-latency and energy-efficient memory devices. As such, newly
emerging NVM devices are increasingly more widely adopted as a de-facto memory, due to the
high memory density, near-zero idle power, non-volatility, and high scalability. For example, in
2017, Intel Optane persistent memory (PMem) devices were successfully integrated into its cloud
servers in order to accelerate in-memory database applications, machine learning algorithms, and
scientific computing Intel (2020b). More recently, in 2020, Samsung Lee et al. (2020) just announced a byte-addressable NVM module of up to 512 GB capacity with a DIMM form factor
that eliminates the overhead of filesystems in modern operating systems. With the increasing computing power of embedded systems such as mobile devices and more performance-critical tasks
being offloaded to them, these emerging low-power NVM technologies are good alternatives to the
DRAM to get adopted in the embedded systems. However, despite of its tremendous performance
benefits, an NVM-based computing system, due to its non-volatility, limited write endurance, and
tight interplay between CPU and NVM device, faces security threats against its data confidentiality
and data integrity from multiple fronts, therefore posing unique challenges to its deployment. So
far, the majority of these security issues for NVM devices have not been extensively studied.
Memory attacks broadly consist of two categories: memory confidentiality attack and memory
integrity attack. Memory confidentiality attack is passive where an attacker attempts to observe
critical information transferred on the memory bus or stored in the memory. In contrast, mem1

ory integrity attack deals with active attacks where an adversary (a) attempts to arbitrarily modify an existing memory block (spoofing attack), (b) swaps memory blocks (splicing attack), or
(c) replays an previously recorded data transaction on the memory bus (replay attack). Memory is also vulnerable to side-channel attacks, e.g. power attacks and timing analysis attacks.
These attacks are orthogonal to this work, since this work only focuses on physical memory attacks. One prevailing security countermeasure against the data confidentiality attack is the data
Plaintext

Initialization Vector
Page Page
No. Oﬀset
64-bit
64-bit
64-bit
64-bit
Major Counters

Major
Ctr
7-bit
7-bit
7-bit
7-bit

Minor
Ctr

Padding

…
…
…
…

AES
OTP
XOR
Engine

Key

XOR

Ciphertext
to NVM
Ciphertext
from NVM

Plaintext

Minor Counters

Figure 1.1: AES counter-mode encryption with split-counter scheme Awad et al. (2016); Ye et al.
(2018).

encryption/decryption, such as the well-studied AES (Advanced Encryption Standard) countermode scheme (AES-CTR) Suh et al. (2003); Ye et al. (2018); Gassend et al. (2003); Rogers et al.
(2007); Alwadi et al. (2019); Zubair and Awad (2019). As shown in Figure 1.1, each counter value
is first fetched from a counter cache/memory, and an initial vector (IV) is generated. The 128-bit
AES block cipher encrypts the IV with the private key to generate an one-time pad (OTP). During
the AES decryption, the ciphertext is read from memory and XOR’ed with the generated OTP
to get the plaintext. By overlapping ciphertext retrieval and AES block cipher, the AES decryption consumes a very short latency Rogers et al. (2007). The AES encryption follows the similar
process. Because AES-CTR strictly prohibits reusing the same IV with the same key to prevent
known-plaintext attacks, in AES encryption, the associated IV is incremented before generating
2

OTP and XOR’ed with the plaintext to guarantee the uniqueness of the IV. There are also some
other AES encryption schemes and counter layout schemes used. For example, Intel SGX stores
8 counters per cacheline Intel (2020d), and Intel TME uses AES-XTS instead of AES-CTR Intel (2020a). To foil memory integrity attacks, most secure processor architectures exploit certain
data authentication algorithms, such as combining message authentication code (MAC) and Bonsai
Merkle tree (BMT), in order to authenticate the data from an untrusted memory. Specifically, in
order to prevent the memory spoofing attack and splicing attack from changing data contents and
data address, a MAC is generated with the tuple (data, addr). Both data and MAC are stored in the
memory. When reading a data block from memory, a MAC is computed first and compared with
the stored MAC. Hash-based MAC (HMAC) is a common type of MAC involving a cryptographic
hash function, e.g. SHA1. Throughout this work, we calculated HMAC using SHA1 Eastlake and
Jones (2001) and truncated the 160-bit output to 64 bits. Additionally, the memory integrity tree,
such as BMT, can effectively counter the replay attack by maintaining a hierarchical hash tree over
the AES counters and persisting the top node, root, within the secure boundary. Despite of their

Data Confidentiality
Data Integrity
Security

Metadata also requires
recoverability
Metadata Recoverability

Extra metadata traffic degrades
runtime performance & NVM lifetime

NVM Lifetime &
Runtime Performance

Extra security primitives
cost more hardware
Maintaining recoverability
degrades runtime performance

Extra hardware needed
for recoverability

Hardware Cost

Figure 1.2: Challenges of deploying secure persistent NVMs.

conceptual simplicity, all these memory security countermeasures have multiple negative ramifications towards the performance of NVM devices. In fact, as depicted in Figure 1.2, designing
3

and implementing a secure and persistent system for the NVM requires judiciously considering
various conflicting dimensions: security, recoverability, performance/lifetime, and hardware cost.
More specifically, in order to maintain data security, the metadata generated by security mechanisms can cause extra traffic that degrades NVM runtime performance and shortens device lifetime.
Additionally, because of the data persistence in NVMs across crashes/reboots, security metadata
generated/used in various security schemes are required to be persistent and recoverable after system boots up (crash consistency). Furthermore, due to its manufacturing process, a NVM device
typically can only sustain a limited number of write operations to each cell before wearing out.
For example, Intel Optane persistent memory is specified to last only 5 years, 24/7, for a total
amount of 350 PB lifetime writes Intel (2020c). Therefore, to reduce the total number of writes
and improve runtime performance, security metadata are normally cached with the write-back policy. Moreover, due to the volatility of the cache, cached data contents including security metadata
are not crash consistent. Finally, all of the security and recovery functionalities require significant
hardware changes to the existing off-the-shelf processors. For example, both Intel and AMD integrated security features with only the newest high-end processors such as Intel’s SGX and AMD’s
SME, SEV, and SEV-SNP. In fact, all these protection schemes require either a processor-specific
software change or a hardware change to the processor.
To overcome all these aforementioned memory security challenges, we proposed an FPGA-assisted
processor-transparent protection mechanism for secure persistent NVM devices. While most memory threat models assume a secure boundary only involving processors (including caches) and
consequently all external components untrusted, as shown in Figure 1.3(a), our ARES security architecture slightly extends its secure boundary to also include its associated FPGA reconfigurable
fabric, as depicted in Figure 1.3(b). The proposed secure architecture not only significantly accelerates the security protection of the system but also ensures reliable metadata recovery, therefore
achieving a persistently secure NVM subsystem. More importantly, the FPGA-augmented com4

Security Boundary
Processor
ASIC
Data Confidentiality
Data Integrity
Metadata Recovery

Custom-Made
Security
Primitives

LLC ($)

System
Memory

(a) Conventional Secure Processor: Microarchitecture / HW needs to be modified / changed.
Security Boundary
Processor

LLC ($)

FPGA Device
Data Confidentiality

MEU

Data Integrity

MVU

Metadata Recovery

MRU

System
Memory

(b) FPGA-Augmented Secure Architecture: Processor-transparent. No HW change is needed.

Figure 1.3: Conventional vs. FPGA-augmented secure architecture.

puting system requires no significant change to the processor design and its associated software
stack, thus realizing a processor-transparent secure computer architecture for persistent memories.
There are at least three distinctive advantages with the FPGA-augmented methodology. Firstly,
the FPGA-augmented protection scheme, being processor-oblivious, is a completely non-invasive
methodology, which provides all three aspects of the security triad (confidentiality, integrity, and
recoverability), to an embedded computing system. In other words, the security provisioning
of ARES requires no hardware modification to existing microprocessors. Secondly, leveraging
FPGA’s potentials including parallelism and pipelining, time-consuming security primitives, such
as AES and BMT, can be efficiently accelerated by hardware. Metadata recovery phase can be
5

accelerated by FPGA as well to greatly shorten system rebooting process and to improve system
availability. Thirdly, the conducted design space exploration can be easily applied to processor IC
design to improve upon existing secure processors, such as Intel SGX and AMD SEV-SNP, due
to the hardware-in-the-loop study of security primitives instead of simulation-based evaluation as
used in computer architecture community. In this work, we discuss and expand upon two of our
typical works, ARES and HERMES, to innovatively utilize FPGA as a transparent middleware
to efficiently protect NVM from all the four dimensions. Specifically, we claim the following
contributions for this series of research works:

• To the best of our knowledge, this research is the first hardware-in-the-loop study of embedded computing platform with persistently secure NVM. In particular, all the performance
evaluations were experimentally conducted with an FPGA-based hardware implementation.
A performance modelling approach was also proposed to analytically investigate performance bottlenecks which provides valuable insights to designing and implementing nextgeneration embedded computing systems equipped with NVM devices.
• Through reasonably relaxing its threat model to include an FPGA within its secure boundary and offloading all security protection and metadata recovery functionalities to FPGA,
we designed an FPGA-assisted and processor-transparent protection to realize a persistently
secure memory subsystem without requiring any software stack modification.
• Explored several design challenges from hardware implementation aspect, and proposed a
series of optimizations to reduce performance overhead of the protection scheme. Especially,
we discussed and addressed several technical challenges of exploiting task-level parallelism
of Bonsai Merkle tree.

6

CHAPTER 2: RESEARCH BACKGROUND AND RELATED WORK

Bonsai Merkle Tree: Standard Usage And Cache Support

Stored in persistent register

Root

Stored in NVM

8B
Hash

…

8B

Hash
64B
Encryption Coutners

…

Merkle Tree Nodes

Secure & Persistent Boundary

Hash: 64B->8B

Figure 2.1: An example 2-level Bonsai Merkle tree Zubair and Awad (2019). (Root level and
counter level are excluded.)

Bonsai Merkle tree (BMT), as one type of data integrity tree, has been extensively adopted to protect the integrity of encryption counters, such as in AES-CTR. Rogers et al. Rogers et al. (2007)
has proved that only protecting counters is sufficient to protect the whole system’s integrity as long
as the HMAC is calculated over a tuple of three elements, (data, address, counter). Other integrity
trees previously studied include SGX-style Tree Costan and Devadas (2016), Parallelizable Authentication Tree (PAT) Hall and Jutla (2005), and Tamper-Evident Counter Tree (TEC-Tree) Elbaz et al. (2007). In this work, We focus on the BMT due to its effectiveness and versatility. To
illustrate its working mechanism, we depict a simple BMT in Figure 2.1, where each leaf node is
a 64-byte counter cacheline and each Merkle tree (MT) node is a concatenation of 8 64-bit hash
values, each calculated over a child node using standard secure hash functions such as SHA1.
Moreover, such hashing is done hierarchically until the top level of a BMT, where only a single
64-bit hash value, the root, is computed. The root is stored within the secure and crash-consistent
7

boundary e.g. an on-chip persistent register, while other nodes are stored in the off-chip memory.
One drawback of using BMT for integrity verification is the performance overhead. In general,
when verifying a counter node, this counter node and all its upper-level tree nodes need to be
read from main memory. Specifically, a N -level tree requires (N + 1) cacheline-size (64-byte)
memory reads and a parallel hashing, consuming (N + 1) × Tmem + Thash in total, where Thash
denotes a single hashing latency. When updating a counter node, all the upper level nodes are
read from memory, updated sequentially from bottom to top, and written back, which consumes
(2N + 1) × Tmem + (N + 1) × Thash clock cycles per update. As a result, without novel computing
techniques, operating a BMT of counter values for memory integrity will incur large volume of
extra memory traffic and severely increase the access latency of counter values. Therefore, there
has been extensive works trying to optimize BMT performance. Gassend et al. Gassend et al.
(2003) proposed a cached Merkle tree. Since cache is within the secure boundary, a cached tree
block can also be used as a root. Therefore, during the verification, if a tree block at a certain level
is cached, there is no need to access higher-level tree blocks. When a tree block is evicted from
cache, the immediate upper-level tree block needs to be updated. The complete scheme is concisely
described as a recursive algorithm, and we refer readers to the original paper Gassend et al. (2003).
Szefer et al. Szefer and Biedermann (2014) proposed an idea of unbalanced Merkle tree by placing
frequently accessed blocks closer to the root to reduce latency and Zou et al. Zou and Lin (2019)
proposed a systematic way to adjust the skewness of a BMT with Huffman encoding. Freij Freij
et al. (2020) proposed a pipelined Merkle tree by overlapping updates of different levels. However,
all the studies related to BMT so far were based on simulation instead of hardware-in-the-loop
evaluation. In some systems, the adopted BMT schemes Gassend et al. (2003); Suh et al. (2003);
Ye et al. (2018); Yang et al. (2020); Freij et al. (2020) contain recursive function calls, e.g. as
shown in Figure 3.1, such that even though they are efficient when verified with simulators, these
algorithms are challenging to be deployed on real hardware or not as performant as estimated.
8

Security Metadata Recovery

Table 2.1: Recovery/Runtime Failure Due To Failed Metadata Persistence Freij et al. (2020)
Ctr (64-byte)
X
X
×

HMAC (64-bit)
X
×
X

Root (64-bit)
×
X
X

Outcome
BMT verification failure
MAC verification failure
Wrong encryption, BMT & MAC failure

When using memory encryption and memory integrity verification to protect persistent memories,
the security metadata accompanying each data block is also required to be crash-persistently stored
in memory, including a 64-bit HMAC, a 64-byte counter block, and a 64-bit BMT root. Any of the
three metadata fails to crash-consistently persist will render the protected data fail to be correctly
used or make the system even more vulnerable, as shown in Table 2.1. However, constrained by its
physical media, persistent memory has limited write endurance and long write latency. To improve
the runtime performance and lifespan of NVMs, all security metadata are typically cached. Without
effective metadata recovery, if such a secure system crashes/reboots, its cached metadata will be
lost and its security scheme will fail.
Osiris Ye et al. (2018) first proposed a scheme to recover counters. Essentially, Osiris persists a
cached copy into memory whenever a counter block is updated N times within the counter cache.
To quickly recover each counter, Osiris encrypts the 64-bit ECC and 64-byte data together, such
that only when the correct counter is used in decryption, the ECC and the data match. After
counters are recovered, the BMT is built from bottom to top. Further, Osiris-Plus Ye et al. (2018)
eliminates counter metadata write-back utilizing the recovery scheme, and recovers data in-flight
to totally avoid counter writes. Similarly, cc-NVM Yang et al. (2019) and ShieldNVM Yang et al.
(2020) adopted the same idea to persist counters. Instead of repurposing ECC, they used HMACs
to recover counters since the HMAC is calculated over (addr, data, counter) when using BMT.
9

However, according to our own Verilog implementation on hardware, the latency of HMAC and
AES encryption are 404 clock cycles and 12 clock cycles respectively. So we estimate the counter
recovery should be much slower than Osiris, even though authors claimed a faster recovery time
based on gem5 simulation. Anubis Zubair and Awad (2019) and Phoenix Alwadi et al. (2019)
focus on quickly recovering the SGX-style tree. SuperMem Zuo et al. (2019) did not consider
memory integrity verification, and used a write-through cache to guarantee crash consistency.

Target Threat Model

Traditionally, as shown in Figure 1.3(a), the chip boundary of a microprocessor is considered
as a natural security boundary. This means that only the processor and its associated cache file
are trusted while the rest of the computer, including its system memory attached to the last-level
cache through system memory bus as well as its storage subsystem are deemed to be insecure 1 .
Therefore, an attacker can potentially obtain sensitive information stored in memory or transferred
through off-chip interconnects during program execution, or directly eavesdropping through NVM
I/O interconnects (i.e., data confidentiality attacks). Even more sophisticatedly, in a data integrity
attack, an adversary can maliciously tamper valuable data (spoofing), swap the data contents at
two different memory locations (splicing), and replaying prior data versions (replay) either in main
physical memory or through memory data bus, etc.. In order to mitigate all these security attacks,
conventional microprocessor architects have added specifically designed security primitives at the
microarchitecture level to create an trusted execution environment (TEE).

1

We also assume that the core part of the operating system software is secure.

10

CHAPTER 3: ARES: PERSISTENTLY SECURE NVM

Motivation And Contribution

ARES aims at implementing persistently secure NVM devices that achieve all three security metrics—
confidentiality, integrity, and recoverability—in a hardware-efficient and processor-transparent
way, by offloading security primitives and metadata recovery functionalities to FPGA.
ARES slightly expands the security boundary by including an FPGA device between LLC (LastLevel Cache) and NVM device as shown in Figure 1.3(b). To effectively counter potential security
risks, ARES constructs three hardware engines—MEU (Memory Encryption Unit), MVU (Memory Verification Unit), and MRU (Metadata Recovery Unit)—in order to provision data confidentiality, integrity, and recoverability, respectively.
The prevailing technology trend, especially the extensive adoption of heterogeneous SoCs in the
computing industry, strongly motivates adopting the ARES architecture. For example, recently
released by Xilinx, the Zynq UltraScale+ FPGA devices that tightly integrate microprocessors
(ARM) and programmable logic (FPGA) have been widely utilized in 5G wireless, next-generation
ADAS, and industrial Internet-of-Things. Meanwhile, Intel has also released a collection of XeonFPGA heterogeneous platforms for data centers, machine learning, and autonomous driving. All
these commercial heterogeneous SoCs not only packaged processors and FPGAs within the same
die, but also connected them with dedicated high-bandwidth data buses, therefore promising ultrahigh performance and impressive ease of use. In the traditional threat model, typically only processors are trusted because the memory I/O via external memory buses can be risky to malicious
security attacks. However, because modern heterogeneous SoCs house both FPGAs and processors within the same chip die and do not expose their connections externally, ARES can reasonably
11

Table 3.1: Comparison with Other Existing Works
Existing Works
Academic Works
Ye et al. (2018)
Zuo et al. (2019)
Yang et al. (2020)
Freij et al. (2020)
Liu et al. (2019)
Intel SGX Costan and Devadas (2016)
Vig et al. (2019)
Vig et al. (2018)
Vig et al. (2017)
Vig et al. (2021)
Werner et al. (2017)
FAST Zou and Lin (2019)
ARES (this work)

Encryption Authentication/Integrity

Recoverability Evaluation Approach

Full Memory Protection

AES-CTR

BMT Rogers et al. (2007)

Yes

Simulation

Yes

AES-XTS

Intel SGX Tree

No

ASIC

No

Yes

TEC Tree Elbaz et al. (2007) No

FPGA

Yes

AES-CTR
AES-CTR

BMT
BMT (HW-Optimized)

HW-SW Co-Simulation Yes
FPGA
Yes

No
Parallelized

include FPGAs into the secure boundary, as depicted in Figure 1.3(b).
ARES memory security architecture aims at achieving persistently secure non-volatile memory
equipped with effective metadata recoverability while requiring no native hardware support from
the processor side. Table 3.1 concludes the comparison with other existing works and the highlights of the proposed ARES. In the table, we only listed academic works relying on AES-CTR
and BMT, of which a typical work is Osiris Ye et al. (2018), although there are a lot of works relying on other integrity trees, e.g. Phoenix Alwadi et al. (2019), Anubis Zubair and Awad (2019),
and etc. All the previous works were evaluated on simulation and did not consider hardware
friendliness when designing various schemes. Instead, ARES focuses on optimizing different
security primitives, especially BMT and recovery, from an aspect of hardware implementation.
Intel SGX is an industrial processor integrating both confidentiality and authentication relying on
AES-XTS and Intel SGX Tree. However, the architecture focuses on DRAM and only protects
a small memory region (enclave). There are also other works based on FPGA platforms such
as Vig et al. (2019, 2018, 2017, 2021); Werner et al. (2017); Zou and Lin (2019), although none
of them considered recoverability.
Specifically, we claim the following contributions:
12

• To the best of our knowledge, this work is the first hardware-in-the-loop study of embedded
computing platform with persistently secure NVM. In particular, all of ARES’ performance
evaluations were experimentally conducted with an FPGA-based hardware implementation.
We also investigated ARES’ performance bottlenecks with analytical performance modelling, which provides valuable insights to designing and implementing the next-generation
embedded computing systems equipped with NVM devices.
• ARES, through reasonably relaxing its threat model to include an FPGA within its secure
boundary and offloading all security protection and metadata recovery functionalities to
FPGA, offers an FPGA-assisted and processor-transparent protection to realize a persistently
secure memory subsystem without requiring any software stack modification.
• More specifically, to improve ARES’ runtime performance, we have proposed a novel Merkle
tree cache structure and a hardware-friendly caching scheme, both of which can jointly enable aggressive pipelining of time-consuming hash functions. Compared to naively flattening
the traditional sequential and recursive Merkle tree scheme, the total latency can be shortened by up to 1.4 times and the system throughput can be improved by up to 2.6 times.
Furthermore, to accelerate metadata recovery and to improve system availability, we have
proposed an innovative parallel counter recovery and Merkle tree reconstruction scheme that
fully leverages FPGA’s inherent hardware parallelism. As a result, the metadata recovery
time of ARES is shortened by 1.8 times over the baseline without the parallel recovery.

Challenges

Despite of the conceptual simplicity of ARES, implementing it with heterogeneous SoCs can potentially encounter multiple challenges. While implementing encryption/decryption hardware engine with reconfigurable logic has been extensively investigated Chodowiec and Gaj (2003); Good
13

and Benaissa (2005), similar work for achieving data integrity and recovery with FPGA logic is
relatively scarce. One key reason is because that the key methodology for achieving data integrity
and recoverability, the Merkle tree (MT) and its associated caching scheme, is in general not very
amenable to hardware implementation. More specifically, there are at least three difficulties.

Figure 3.1: The most commonly used Merkle tree scheme Gassend et al. (2003) utilizes a cache
and a recursive algorithm to reduce performance overhead of memory integrity verification. The
algorithm is friendly to von-Neumann devices, such as processors, with sequential execution. Spatial computing devices, such as FPGA, exploit instruction-level and data-level parallelism relying
on deploying all functionalities on a “flat empty paper” and coordinating with control signals. The
theoretically-endless recursive function call is unfriendly to control logic design and flat-deploying
hardware modules, since the datapath is indeterministic, consequently not able to exploit FPGA
potentials.

Firstly, Traditional write-back/write-through caching policy is not compliant with the requirement
of Trusted Execution Environment (TEE) when integrated with Merkle tree. All the data residing
within TEE must be verified. When reading a block which misses in the cache using the traditional
caching policy, the missed cacheline is read from memory, put into cache, and returned to upper
logic. Since the cacheline is directly from the external memory and not verified, the traditional
read miss policy violates TEE’s requirement. There are two approaches to resolve the violation:
(1) integrating MT inside cache controller, and (2) designing a cache with a new caching policy.
14

Recursively & Sequentially
Verify Tree Nodes & Counter
RdVerifyCtr
RdVerifyMT(L)

Parallel Verify Tree Nodes & Counter
No dependency
Execute in parallel

RdVerifyCtr
RdVerifyMT(L)

…

RdVerifyMT(2)

…

RdVerifyMT(2)

RdVerifyMT(1)

RdVerifyMT(1)

RdVerifyMT(2)

RdVerifyMT(2)

Start

Y

hit?

RdVerifyMT(1)

N

RdVerifyMT(1)

N
L2 hit?

L2 hit?

N
Y L1 hit? N

Y

Start
hit?

Wasted Cycles

Y RD(L1)
N

CR(L2)

Pipelined
Hashing

Comp
CW(L1)

Hash(L2)

Parallel
Verify Comp

Ret

CW(L2)

Control Flow/
Function Call

Comp

RD(L1)

CR(L1)

Hash(L1)

Barrier Synchronization

CR(L1)

Hash(L2)

RD(L2)

Barrier Synchronization

Hash(L1)

RD(L2)

Y

Comp
CW(L1)

Barrier Synchronization

CW(L2)

Unexploited
Parallelizable
Operations

CR(L2)

Ret

Ret

(a)

(b)

Figure 3.2: A comparison between the traditional MT scheme and ARES MT scheme when interacting with cache. (a) Flattening traditional recursive MT scheme generates a sequential execution
hiding all the parallelization opportunities and consequently wasting clock cycles. (b) ARES decomposes each routine into sub-operations and rearranges operations into barrier-synchronized
stages such that all the non-dependent operations within the same stage can be executed in parallel/pipeline. RD: read from DRAM; CR: read from cache; CW: write to cache.

All the prior simulation-based works Suh et al. (2003); Ye et al. (2018); Zubair and Awad (2019);
Gassend et al. (2003) adopted the first approach tightly coupling MT logic with cache control logic.
This tightly coupled architecture is not ideal for hardware design and implementation.
Secondly, the traditional Merkle tree caching scheme, being highly recursive and sequential, is
not hardware-friendly. Although the content integrity of secure memory is commonly protected
15

with a MT-based scheme, its runtime performance is typically unsatisfactory. As such, caching
is commonly utilized to improve MT’s runtime performance and to reduce its memory traffic.
Unfortunately, the existing crash-consistent MT caching scheme is not hardware friendly due to
its recursion algorithm. Figure 3.1 illustrates such an example of recursion when reading and
verifying data from an external memory with a software implementation. One straightforward
implementation is to simply flatten this recursive scheme into a non-recursive one in hardware
by renaming recursive functions. However, doing so will incur data dependency between stages,
therefore preventing the efficient utilization of different hardware modules. If utilizing software,
these sequential dependence can be solved using speculative execution as mentioned in Figure 3.1.
However, implementing the speculative execution on pure FPGA hardware greatly increases logic
complexity. Furthermore, tackling potentially unbounded recursive calls with hardware usually
proves to be quite cumbersome. Another different approach to improving MT scheme’s runtime
performance is to exploit deep pipelining for high throughput. However, as shown in Figure 3.2,
because processing stages are sequentially accessed due to the algorithmic dependency of MT
scheme, all these time-consuming hash functions cannot be pipelined even though each hashing
unit can process a new input per clock cycle.
Thirdly, conventional metadata recovery process is prohibitively time-consuming because it needs
to recover counters and completely rebuild the whole Merkle tree from the leaf level to the root
level. At the basic level, ARES adopts the same recovery strategy as in Osiris Ye et al. (2018),
which requires recovering all the cached-but-lost counters and rebuilding the whole Merkle Tree.
To recover each counter block, all the associated 64 data blocks are traversed, and for each data
all the possible counters need to be tried to find the correct one. To rebuild each Merkle Tree
node, all its child nodes are first read from memory and for each of its child nodes, a hash value
is computed. Overall, such a recovery process is quite long in total run time and becomes a key
bottleneck to the total metadata recovery time, which further degrades the system availability. For
16

HPC systems and cloud service providers, system availability is a strict requirement. To meet the
common availability target of 99.999% (five nines rule), a system is only allowed to be down for a
maximum of 5.25 minutes per year Zubair and Awad (2019). For instance, 70 thousand dollars are
cost for each minute of the Amazon cloud service being down Forbes (2013). Thus, it is important
to accelerate metadata recovery to improve the system availability.

Methodology
1. read req.
Controller

4. no eviction
Controller

5. read resp.
4. hit

3. read from mem.
& put into cache
(a)

2. miss

…

Cache Entries

2. miss

…

Cache Entries

6. read resp.

1. read req.

5. eviction

3. directly return without
puing into cache
(b)

Figure 3.3: Comparison of read miss in traditional cache and ARES MT cache. (a) Traditional
cache. (b) ARES MT cache.

ARES devised three key strategies to effectively address the above three technical challenges.
Firstly, ARES decouples a BMT from its cache such that each works independently. As such, we
have designed a new caching policy with a different read miss and write miss policy compared
to the traditional cache. Essentially, when a read misses in the cache, the cacheline is directly
returned to upper logic without putting into cache, as shown in Figure ??. When a cacheline write
misses in the cache, the dirty flag of the cacheline is set according to Merkle tree operation, i.e.
verification or update, instead of setting to dirty in any case as in the traditional cache. Secondly,
ARES proposed a split caching scheme associating each Merkle tree level a separate cache to par17

allelize cache accesses. Furthermore, leveraging the split caching architecture, ARES proposed
a new hardware friendly ARES Merkle tree scheme, avoiding control dependency among recursive functions by reorganizing computation and pipelining time-consuming hashing operations,
without violating the principle that only verified data/metadata can be brought into the trusted environment. Thirdly, in ARES, we innovatively decompose a target MT into multiple sub-trees
such that different portions of an MT can be rebuilt in parallel in order to maximally leverage
the task-level parallelism. Similarly, counter recovery is also accelerated by parallelizing recovery
modules. Furthermore, to exploit the reconfigurable computing power of modern FPGA logic,
ARES dedicatedly designs a fully pipelined data path for each metadata recovery.

Cache-Augmented Bonsai Merkle Tree
One crucial requirement of Trusted Execution Environment (TEE) standard is that any data item
residing within a secure boundary must be integrity-verified before its entry. Therefore, in ARES,
a running program’s state is constantly monitored by the module of memory integrity verification
in order to prevent an attacker from tampering with its off-chip untrusted memory. Specifically,
in the case of Merkle tree cache, all data blocks must be verified before being written into cache.
Consequently, the traditional cache with the write-back/write-through policy cannot be directly
utilized in a Merkle tree. In ARES, we proposed a new BMT (Bonsai Merkle Tree) caching
scheme, which enables reordering all Merkle tree memory accesses and their associated hashing
operations in order to minimize data and control dependencies. This decoupling of operational
dependency is critically important for our ARES BMT caching scheme to be readily pipelined, as
shown in Figure 3.2. However, such operation reordering in a BMT caching scheme can potentially
violate TEE’s security requirements, i.e., only integrity-verified data can be written into cache.
To mitigate this issue, we designed a new ARES BMT cache both architecturally different from
conventional cache and controlled with a modified write-back caching policy. Specifically, when
18

a read miss occurs in the cache, instead of transferring data from lower-level memory into cache
and then return, the ARES BMT cache directly returns data to its upper logic without polluting the
TEE with the unverified data. As a result, the runtime performance of ARES integrity verification
process can be significantly improved while still strictly following TEE’s constraint.

Novel BMT Cache Architecture in ARES

NVM

MT0 Cache
MT1 Cache

…
MTN Cache

Interconnect

… … …

Parallel Cache Access

Diﬀerent level has a cache
of a diﬀerent size

…

Other
Metadata

Merkle Tree
Metadata
No inter-level cache
evictions

Figure 3.4: Cache partitioning for Merkle tree. Different level can have a cache of different size
according to the different reuse distance.

The TEE standard stipulates that only verified metadata can be written into a secure boundary.
In ARES, the BMT cache is partitioned into independent sub-caches. In fact, each tree level is
equipped with a separate cache of different size. There are three main advantages of this design
approach. First, by splitting cache, it is guaranteed that when traversing along a tree branch, putting
certain-level tree node into cache will not evict other on-branch nodes. This guarantee could greatly
ease the design of the ARES Merkle Tree caching scheme. Second, splitting the Merkle tree cache
by levels readily enables parallel cache access. In contrast, for a unified cache, in order to simultaneously process cache requests from multiple levels, a miss status handling register (MSHR) is
required to track in-flight requests. Unfortunately, implementing MSHR with FPGA fabric will
19

consume a huge amount of registers and multiplexers to compare all the entries in parallel, consequently quite detrimental to meeting timing closure. However, partitioning cache eliminates the
need for MSHRs and the design could be more resource-efficient and timing-friendly.

Modified BMT Cache Control Policy for ARES

Table 3.2: Traditional Write-Back Cache vs. ARES Merkle Tree Cache

Read Miss
Write Miss
(Verification)
Write Miss
(Update)
Write Hit
Read Hit

Traditional Write-Back Cache
1. Read data from memory
2. Put data into cache
3. Return data to upper logic
4. Write evicted dirty block to memory

ARES Merkle Tree Cache
1. Read data from memory
2. Return data to upper logic

1. Put data into cache, mark clean
2. Write evicted dirty block to memory
1. Put data into cache, mark dirty
2. Write evicted dirty block to memory
Update the hit cacheline and mark dirty
Return the hit cacheline to upper logic

1. Put data into cache, mark dirty
2. Write evicted dirty block to memory

In order not to violate the TEE requirement, we implemented a new caching policy specifically
designed for an ARES Merkle tree cache. The differences between the standard write-back policy
and the ARES MT policy are shown in Table 4.2. The ARES caching policy and the standard
write-back caching policy differ in three cases:

1. On read miss, ARES MT cache does not put data into cache since the block is not verified
yet. When the data is verified by the upper logic, the data is written into cache, consequently
generating two kinds of write misses, as explained in bullets (2) and (3).
2. During the verification process, the uncached data directly from memory needs to be written
back to cache and consequently incurs write miss. Since the data is only used for verification

20

without being modified, the cacheline is marked clean such that no need to write back when
the data is later evicted from cache.
3. In contrast, during the Merkle tree update process, each uncached MT node directly read
from memory is updated using child nodes, thus this type of write miss needs to mark the
cacheline dirty. The dirty cacheline requires a write-back when later evicted from cache.

ARES assumes that all cache accesses are in the granularity of 64 bytes, thus updating partial
bits of a cacheline requires reading-modifying-writing. The support for in-cache partial update is
unnecessary since in Merkle tree a hash function always take a 64-byte whole cacheline as input.

New Merkle Tree Scheme for ARES BMT Cache
To maintain the crash consistency, the root node is required to record all the counter updates and
always up-to-date, such that the root can be used as a pivot to verify the success of metadata recovery and guarantee memory integrity across system crashes. The traditional MT scheme, as listed
in Algorithm 3, was widely used in prior works. Although this scheme seems reasonable and easy
to implement in gem5 simulation, it is inefficient if not infeasible when naively implemented with
FPGA logic in real hardware implementation. Specifically, first, the traditional MT scheme is a
recursive chain which is only applicable to processors which use the Von-Neumann architecture.
However, for spatial computing devices such as FPGA, the recursion is not feasible. Even though
there are a lot of C-to-RTL high-level programming languages, how to deal with a recursive C
code is still a extremely difficult research problem in high-level synthesis research field. Second,
even if this recursion algorithm can be inelegantly flattened into separate modules by using function renaming. However, because speculative execution is heavily utilized, this scheme is almost
impossible to implement with FPGA hardware. Finally, if forcibly flattening the recursion and not

21

Algorithm 1 ARES Merkle Tree Counter Read
1: L: number of internal Merkle tree levels
2: hit[L]: cache hit status of internal Merkle Tree blocks
3: block[L + 1]: an array of temporary registers
4: mask[L + 1]: a mask indicating which level needs to be verified
5: function PARALLELV ERIFICATION
6:
verify block in parallel
7:
if all the levels masked by mask are verified then
8:
return 1;
9:
end if
10: end function
11:
12: function R EAD C OUNTER
13:
counter level
14:
counter always needs to be verified
15:
read counter from memory and put into block[L]
16:
mask[L] ← 1
17:
internal level
18:
for l ← L to 0 do
19:
if there is at least one hit from hit[l] to hit[L − 1] then
20:
mask[l] ← 0
21:
else
22:
read the l-th level tree block from cache and put into block[l]
23:
mask[l] ← 1
24:
end if
25:
end for
26:
if ParallelVerification then
27:
for l ← L − 1 to 0 do
28:
if mask[l] is 1 then
29:
write block[l] into cache
30:
end if
31:
end for
32:
return counter
33:
end if
34: end function

Algorithm 2 ARES Merkle Tree Counter Write
1: L: number of internal Merkle tree levels
2: hit[L]: cache hit status of internal Merkle Tree blocks
3: block[L + 1]: an array of temporary registers
4: mask[L + 1]: a mask indicating which level needs to be verified
5: function S EQUENTIAL U PDATE
6:
verify block in parallel
7:
if all the levels are verified then
8:
sequentially update each level in block
9:
return 1
10:
end if
11: end function
12:
13: function W RITE C OUNTER
14:
read all the internal levels from cache and put into block
15:
if SequentialUpdate then
16:
for l ← L − 1 to 0 do
17:
write block[l] into cache
18:
end for
19:
write counter to memory
20:
return
21:
end if
22: end function

22

using speculative execution, the resulting scheme, shown in Figure 3.2, becomes a sequential algorithm that can not exploit FPGA’s potentials such as pipelining and parallelism. Therefore, a new
cache-aware Merkle Tree scheme needs to be designed from scratch taking the goal of exploiting
FPGA pipelining in mind.
Given the partitioned design of ARES BMT cache, we designed a new Merkle tree scheme with
modified counter read and write operations. More specifically, as shown in Algorithm 1, when
reading a counter value from the ARES BMT, this counter value will be first returned from main
memory and written into a temporary register allocated within the ARES BMT controller. Subsequently, the cache controller first queries the hit status of all the upper-level tree blocks within each
level’s cache. For each level, if there is a cache hit among lower levels, there is no need to read this
level from cache and verify since the cached lower level can be used as a root, otherwise this level
is read from the cache and written into a temporary register. Consequently, a mask is generated
according to the hit status of levels and input to a parallel verification unit to verify blocks in the
temporary registers in parallel. Only when the verification passes, the counter is returned to the
upper logic. As mentioned in Section 3, the controller needs to write all the read blocks which miss
in the cache back such that they could be cached for future usage. As such, after the verification
all the temporary registers which involve the verification are written back to the cache in parallel.
Similarly, as shown in Algorithm 2, when writing a counter value into the ARES BMT, the controller queries the hit status similarly. All the levels are read from cache and stored into temporary
registers regardless cached or not, since all the levels will be updated. Before sequentially updating each level, uncached levels which are directly read from memory need to be verified first,
while the verification can be skipped if all levels are cached. For the hardware simplicity, all the
levels except the counter level are verified if there is at least one level uncached even though only
the uncached level requires verification. Since verification is done in parallel, there is no latency
overhead of verifying all the levels. After verification is finished, all the tree blocks in temporary
23

registers are sequentially updated and written back to memory. Counter is also persisted back to

B1 B2
C1 C2

B2
rd B2
rd C2

(a)

R
A1
B1 B2
C1 C2

Split Cache
wr A1
rd A1
wr B2
rd B2
rd C2

(b)

miss
hit

R
A1
B1 B2
C1 C2

Split Cache
wr A1
rd A1
wr B2
rd B2

Memory

Split Cache

Parallel Veriﬁcation
& sequential Update

A1

no need
to read
A1

miss
miss

Memory

R

Parallel Veriﬁcation

miss
hit

Memory

Parallel Veriﬁcation

memory if evicted from the counter cache.

B2
wr C2

(c)

Figure 3.5: (a) Counter read (there is at least one level is hit in the cache). (b) Counter read (all
levels miss in the cache). (c) Counter write.

Examples are shown in Figure 3.5 where we used a simplified binary tree. In the examples, we
focus on operation of C2 and associated upper-level nodes, B2, A1, and R. When reading a counter
(C2) and there is at least one level (B2) already in the cache as shown in Figure 3.5 (a), there is
no need to read A1 anymore since B2 can be used as a root. The caching status of each level
is also input to the verification unit to mask levels which do not require the verification. When
there is no cached blocks as shown in Figure 3.5 (b), both A1 and B2 are read from memory first
without putting into cache. After parallel verification, both A1 and B2 needs to be written into
cache. When writing a counter as shown in Figure 3.5 (c), all the upper-level blocks are read from
cache. If there is at least one level not cached (A1), then all the levels are verified in parallel and
then sequentially updated. The updated A1 is then updated within the cache while the updated B2
is written into cache as a write miss.

24

FPGA-Accelerated Metadata Recovery
For mission-critical applications, system availability is a strict requirement. As such, in ARES,
after system reboots from an unexpected crash, a metadata recovery phase is required before
starting the normal operation mode. All the data blocks are sequentially traversed, and for each
64-byte data block, the accompanied 64-bit ECC, the 64-bit major counter, and the 7-bit minor
counter are retrieved from memory. Note that during the recovery scheme all the metadata accesses must not go through the cache since they are not trusted yet. The major counter and
minor counter compose an initial vector (IV) first. The ECC and the data block are concatenated as a 576-bit tuple (ECCcipher , datacipher ) and input to AES-CTR decryption using the IV
to get ECCplain and dataplain respectively. A new ECC is calculated over the data plaintext,
ECCnew = ECC(dataplain ), and compared with the decrypted ECCplain . A correct counter is
found only when the two ECCs match. There are three possible reasons of mismatch between the
two ECCs as shown in Table 3.3.

Table 3.3: Common Sources And Fixes of ECC Mismatch Ye et al. (2018)
Error Type
Error on Data
Error on ECC
Stale/Wrong IV

Fix
Using ECCplain to recover dataplain
if the error is recoverable, e.g. single bit failure.
Unrecoverable, replacing the device.
A stale IV is used. Incrementing the counter,
regenerating the IV, and trying again.

Since AES-CTR does not propagate errors, i.e., an error in i-th bit of ciphertext will result in an
error in i-th bit of plaintext. Osiris does not change the ECC’s functionality to fix the first two
errors. If the mismatch is caused by the third reason, it means a stale IV is used, the minor counter
is incremented and a second decryption is conducted using the new IV. Due to the stop-loss mechanism used in counter cache, the counter is guaranteed to fall within the range [CurrentCounter,
CurrentCounter+N ), where N is referred as the Osiris persisting threshold (OPT), indicating the
25

frequency to persist counters back to the memory, while CurrentCounter indicates the counter value
stored in NVM after system reboots. The reasoning behind this recovery process is that, whereas a
minor counter is incremented by 1 on each usage for encryption, the stop-loss mechanism persists
the counter in the cache every N incrementals. Therefore, the maximum difference between the
counter value in the volatile cache and the counter value persisted in NVM, which will be observed in NVM after system reboots, is N . After all the counters at the leaf level are recovered,
a whole Merkle Tree is rebuilt from bottom to top. For each 64-byte parent block, eight 64-byte
child blocks are read from memory, and hashed respectively to fill the parent block. This process
is iterated until the root node. After rebuilding the Merkle Tree, the new root node is compared
with the root within the secure boundary to detect unaware tampering across system crashes and
to verify the success of the whole recovery process.
Top Tree
Root

Child MT 1

Child MT 7

…

…

MT Node Reconstruction

Write Parent MT

Child MT 0

Pipelined Hash

…
…

…

…

…

Sub-Tree 1

Sub-Tree 2

…

…

…

Sub-Tree 8

Parallel Sub-Tree Reconstruction

Figure 3.6: Parallel Merkle tree reconstruction exploiting both task-level parallelism (simultaneously rebuilding sub-trees) and data-level parallelism (pipelining the reconstruction of single MT
node).

Parallel counter recovery. When recovering counters, each counter is traversed and for each 7bit minor counter all the possible counters within the range [CurrentCounter, CurrentCounter+N)
are tried one by one to decrypt the associated data and ECC. In Osiris, there is no mechanism
recording which counters to recover such that all the counters are required to read from memory.
When recovering a large NVM device, for example 512 GB, traversing all the counter blocks and
26

Algorithm 3 Traditional Merkle Tree Caching Scheme
function R EADA ND C HECKblk
read data from memory
Hnew = HASH(data)
return Hnew to start speculative execution
Hold = ReadparentBlk
if Hnew 6= Hold then
raise exception
end if
end function
function R EADblk
if blk is root then
return root
else
if blk is cached then
return data
else
data = ReadAndCheck(blk)
put data into cache
return data
end if
end if
end function
function W RITEblk, datanew
dataold = Readblk
modify blk to the new value datanew in cache
Hnew = HASH(datanew )
PartialWriteparentBlk, Hnew
end function
function PARTIALW RITEblk, update
if blk is root then
root = update
else
dataold = Readblk
partially overwrite single 64-bit hash with update and get datanew
Writeblk, datanew
end if
end function

consequently traversing all the data blocks consumes a large amount of time. Thus it is necessary
to accelerate the recovery process. ARES exploits the FPGA parallelism to concurrently recover
multiple counters by instantiating multiple recovery pipelines. Each counter recovery pipeline is
responsible for single counter block, as shown in Figure 3.7.
Parallel Merkle tree rebuilding. Without parallelization rebuilding a Merkle tree from bottom
to top is a time-consuming process. In ARES, we fully utilize FPGA’s parallelism to reconstruct
a Merkle tree in parallel. Essentially, we split a Merkle tree into multiple sub-trees and a toplevel tree, such that each sub-tree can be rebuilt independently to exploit task-level parallelism.
27

Ctr 0

Ctr 0 Recovery

Ctr 1

…

Ctr 1 Recovery

…

2. Read data

Ctr P

Ctr 1 Recovery

2. Read ECC

Parallel Ctr Reocvery

Iterate over minor counters 4. Write ctr

AES
AES
AES
AES

ECC=Plaintext?

1. Read ctr

3. Find correct
counter

Ctr Recovery Process

Figure 3.7: Parallel counter recovery. To recover single counter, a counter block is read and each
minor counter block is recovered. For each minor counter, the associated data block and ECC
are read and decrypted with all the possible counters to recover. After all the minor counters are
recovered, the recovered counter is written back to memory.

Depending on available number of parallel Merkle tree rebuilding engines, the recovery of subtrees are round-robin distributed to the parallel engines. Each engine have a separate port to access
the memory such that engines can work independently. Figure 3.6 shows an example of splitting
a Merkle tree into 8 sub-trees and 1 top-level tree and parallel rebuilding of the 8 sub-trees. For
each sub-tree, the recovery process starts from the leaf level. When recovering a single parent MT
node, each of 8 child nodes is sequentially read from memory, calculated the hash value, and filled
the parent MT node. After all the 8 child nodes are traversed and hashed, the filled parent block is
written back to the memory. The process of retrieving a child MT node and calculating the hash
value is fully pipelined in the design such that for each clock cycle a new child node can be read
and hashed. Thus by parallelizing and pipelining, multiple child MT nodes can be simultaneously
read from memory and hashed to greatly improve the Merkle tree rebuilding performance.

28

Metadata Recovery Unit
CTR Recovery
1-bit
uBlaze

recovery
mode

plaintext

is_veriﬁed
1-bit

64B

L1 Cache

AES-CTR

HMAC

recovered ctr
rebuilt mt

MAC

ECC
Calc

64B
ciphertext
persisted ctr blk 64B

64B
Mem. Veriﬁcation Unit

MEU
CTR Cache

MT Recovery

Cache

ctr blk
64B

64B
ctr blk
64B

Cache
Cache

mt blk
64B

ECC

Data

HMAC
CTR
MT

Figure 3.8: System overview of ARES.

Architecture Design And Hardware Implementation

ARES System Overview
The overall system architecture of ARES, as depicted in Figure 3.8, consists of three key modules for NVM security: Memory Encryption Unit (MEU), Memory Verification Unit (MVU), and
Metadata Recovery Unit (MRU), each of which ensures data confidentiality, data integrity, and
metadata recovery, respectively. In our hardware implementation, we used Xilinx MicroBlaze
(uBlaze) as an embedded processor and mounted the ARES after the L1 cache. uBlaze is configured to use a write-back cache and the access granularity is 64-byte. Furthermore, since Xilinx
DRAM controller IP does not expose ECC to the external, to correctly emulate the behavior that
ECC and data are simultaneously written or read in the real system, we used a second DRAM to
work as a separate ECC storage such that data and ECC access are done in parallel. Specifically,
a memory is split into four consecutive regions—data, HMACs, counters, and Merkle Tree nodes.
All the memory accesses are in the granularity of cacheline-size (64-byte). Each HMAC cacheline
contains 8 64-bit MACs, which covers a 512B data range. Each counter cacheline contains 64
29

minor counters, which covers a 4KB data range. Each ECC cacheline contains 8 ECCs, which
covers a 512B data range.

NVM Memory Encryption
ARES adopts the same memory encryption scheme as Osiris Ye et al. (2018). Rather than only
encrypting data using the counter-mode AES encryption, ARES encrypts both 64-bit ECC and
64-byte data. Thus, instead of generating an 512-bit OTP to XOR plaintext/ciphertext, ARES
generates a 576-bit OTP. In encryption, ECC and data are concatenated and XOR’ed with the OTP,
and in decryption, ciphertexts of ECC and data are concatenated and XOR’ed with the OTP to
get plaintexts. Thus, given a counter, the plaintext and ciphertext of a (ECC, data) tuple should
maintain a relationship according to Decctr (ECCcipher ) = ECC(Decctr (datacipher )). AES-CTR
does not propagate errors, i.e., an error in ith bit of an ciphertext will result in an error in ith bit of
the decrypted plaintext, thus the original functionality of ECC bits are not affected.

Counter Cache with Recovery Support
A new counter cache design supporting the recovery is different from the traditional cache. As
shown in Figure 3.9, we designed a direct memory-mapped cache for the hardware resource efficiency. Besides a 64-byte data, each cache entry contains an extra counter recording how many
times the cacheline is updated within the cache. When the update frequency is N , possible values of the counter field is within the range [0, N ). Every N updates within the cache, i.e. when
counter overflows, the data is persisted back into memory and the counter is reset to 0. Different
from the evicted cacheline, persisted cacheline is directly written into memory without a need to
update the downstream Merkle tree. This guarantees the persisted data block is successfully written back to memory before a system crashes. When a counter read miss or write hit occurs, the
30

counter addr

Tag Index Oﬀset

Vld Dirty Tag

addr

Ctr Data

AES-CTR
data_out
data_in

MUX

…

…

=?

1. Read miss requires MT
veriﬁcation
2. Write hit and eviction
require MT update

=?
Persist data when counter MUX
equals
Persistent System Memory

Memory Veriﬁcation

Figure 3.9: Counter cache supporting counter recovery.

counter cache transfers the address of the counter to access to the memory verification unit. The
memory verification unit will return the response to the counter cache controller only when the
verification/update is successful. If the verification fails or a data is tampered causing the failure of
the memory verification, the required lower-level access will not return the response such that the
counter cache controller will hang. This behavior is desired since when an unaware modification is
detected, there is no point of keeping using the polluted device. The hanging of the counter cache
can guarantee there is no subsequent usage on the device.

Memory Verification Unit
Memory verification unit consists of a Merkle tree engine (MTE) with a fully pipelined SHA1
engine with a latency of 31 clock cycles and initial interval as 1, and multiple separate caches.
MT cache is a directly-mapped cache without using the victim buffer due to the reuse distance
of MT blocks. Each cache is associated with a Merkle tree level and configured with a different
size according to the hit rate of each level. A lower level has a lower hit rate and requires a larger

31

MT Cache
(Level-1)

Merkle Tree Engine
Pipelined
SHA1

MT Cache
(Level-2)

Sequential
Update

Controller

MT Cache
(Level-N)

…

Parallel
Veriﬁcation

Persistent System Memory

hit from all levels are
concatenated
AXI Interconnect

MT Controller

Figure 3.10: Memory Verification Unit

cache, while a higher level has a higher hit rate and a small cache is sufficient. An MT controller
is responsible for issuing addresses to each cache and collecting the hitting status of each required
tree block within the corresponding cache. These hitting status are concatenated and input to MTE.
MTE leverages the SHA1 module to verify all the levels in pipeline.

Metadata Recovery Unit
For security metadata recovery, throughout our implementations, we assume that our ARES system does not crash during AXI outstanding transactions such that all the on-chip transactions can
be guaranteed to be atomically finished. Additionally, memory write operations are assumed to be
atomic, i.e., all the write operations are guaranteed to be already persisted when system crashes.
Empirically, this atomic feature can be guaranteed by using Write Pending Queue (WPQ) and
Asynchronous DRAM Self-Refresh (ADR) Edirisooriya et al. (2018); Rudoff (2016), which collects all the memory write operations into a small buffer and maintains enough power to flush the
contents of WPQ. How to guarantee the atomicity or persistence of on-chip or off-chip transactions
are out of the scope of this work, and we assume all the previous assumptions are valid. Specif32

ically, throughout the work, we used a DRAM to emulate a persistent memory since there is no
off-the-shelf persistent memory compatible with the FPGA board we are using.

SHA1

MTRE

…

CTRE

Top-Tree Recovery
MTRE

SHA1

SHA1

AXI Interconnect

MTRE

…

SHA1

AES

CTRE

MTRE

Controller

Persistent System Memory

Parallel Counter Recovery
AES

Parallel Sub-Tree Recovery

Figure 3.11: Metadata Recovery Unit

Architecturally, a metadata recovery unit consists of parallel counter recovery engines (CTREs)
and multiple parallel Merkle tree rebuilding engines (MTREs). When receiving a start signal from
processor, CTRE traverses each data block and recovers the major counter and the minor counter.
Each CTRE contains an AES module to decrypt the ciphertext and the associated ECC bits for
each minor counter. For each minor counter, there are N possible minor counters to try, and the N
trials are pipelined sharing the same AES instance. P CTRE work in parallel in the design. After
recovering counters, P instances of MTRE recover sub-trees in parallel. Each MTRE contains
a SHA1 instance for calculating hash value of a child block and update the parent block. After
the reconstruction of sub-trees, one MTRE instance is reused to build the top tree. The recovery
of sub-trees are round-robin distributed to P parallel recovery instances. For the simplicity, we
assume the number of sub-trees is a multiple of P .

33

Experiment And Result Analysis

Experimental Setup
To evaluate the hardware efficiency and the run-time performance of the proposed Merkle tree
scheme and the accelerated metadata recovery, we implemented a complete ARES system on a
Xilinx U200 data center accelerator card equipped with four 16GB DDR4 devices each supporting
up to 2400 MT/s data rate. Since the Xilinx DDR4 controller IP does not expose ECC bits to the
user logic, we used one DRAM to store data blocks, counters, HMACs, and Merkle tree blocks and
a second DRAM to store ECC bits separately. By utilizing two DRAMs, a 64-byte data block and
its associated 64-bit ECC bits can be accessed independently and simultaneously, thus accurately
emulating the timing of atomically accessing a data block and ECC bits as in the DRAM controller
exposing the ECC interface. An Xilinx MicroBlaze soft processor, containing a 32 KB Level-1
data cache (L1 cache) and no Level-2 cache, is instantiated to work as an embedded processor
operating at 100MHz. Our ARES subsystem is mounted after L1 cache such that all the requests
to lower-level memory are protected. Both the counter cache and Merkle tree cache are developed
with Verilog while AES encryption, HMAC, and Merkle tree scheme are developed with Vivado
HLS. To enhance hardware portability, different memory protection functionalities are wrapped
as separate AXI-compliant IPs and connected with the AXI interconnect. In addition, the Merkle
tree cache for each level requires a separate port connecting to DRAM. The metadata recovery
also requires a separate port for each instance to avoid port conflicts. Limited by the Xilinx AXI
interconnect IP, each slave device (DDR in the design) is only allowed to connect up to 16 master
ports, such that the current design can only support a Merkle tree of 5 levels which protects a data
range of 128MB in total. This limit can be eliminated by replacing the current AXI-based design
with an RTL design, and we leave this to our future work. The number of Merkle tree levels and
the required memory size storing each type of metadata are listed in Table 3.5. According to the
34

DDR4 specification, a DDR4-2400 device has a bandwidth up to 19.2MB/s. Therefore, according
to our performance model in APPENDIX , the parallelism should not be larger then 3 before
saturating the DRAM bandwidth, and consequently we set the parallelism to 2 in the experiment
if not specifically mentioned. In the experiment, the counter recovery parallelism Pctr and the MT
recovery parallelism Pmt are set to the same value P .
In addition to ARES, we also implemented and tested an unprotected insecure system (insecure)
and a baseline protection subsystem (baseline) using the traditional non-optimized Merkle tree
scheme. To justify the fairness of our experimental comparisons, all three implementations follow
the same configuration, such as cache sizes, and their clock frequencies are all set to 100MHz.
In order to make our benchmarking comprehensive, we measured seven representative applications from three different benchmark sets, which are categorized as memory-intensive by other
literatures Pallister et al. (2013). Similar to the performance measurement used in STREAM
McCalpin (1995), for each benchmark, we ran the experiment for ten times and collected the average performance over the last nine runs, while the first run was used to warm up caches.

Table 3.4: Metadata Ranges for Protecting 128MB Data.
Data Range
128 MB
# MT Levels
5

HMAC Range
16 MB
MT Range
292.56 KB

Counter Range
2 MB
ECC Range
16 MB

Resource Utilization
We first evaluated the resource utilization of different security primitives used in ARES. As shown
in Table 3.6, ARES only occupies 23.4% LUT, 8.1% FF, and 5.5% BRAM resources. Among
different hardware components, Merkle Tree consumed most of hardware resources due to its
35

Table 3.5: Composition of ARES Benchmarking Suite (Memory-Intensive Applications).
Applications
STREAMCOPY
STREAMSCALE
STREAMADD
STREAMTRIAD
FIR
IIR
BINARYSEARCH

Benchmark Suite
STREAM McCalpin (1995)
DSPStone Zivojnovic (1994)
BEEBS Pallister et al. (2013)

Problem Size
16 MB
16 MB
16 MB
16 MB
128 MB
12MB
128 MB

Modification
Changed floating-point
operations to integer operations.

The key to search is randomly
generated.

Table 3.6: Resource Utilization of Hardware Components
Hardware Module
HMAC
AES
Merkle Tree
Counter Recovery
MT Rebuilding
Counter Cache
MT Cache (Total)
Total

LUT (1182240)
6205(0.5%)
36997(3.1%)
39826(3.4%)
36001(3.0%)
19364(1.6%)
3391(0.3%)
15791(1.3%)
157575(13.3%)

FF (2364480)
6133(0.3%)
13038(0.6%)
72267(3.1%)
13053(0.6%)
15377(0.7%)
4019(0.2%)
24938(1.1%)
148825(6.3%)

BRAM (2160)
0(0.0%)
0(0.0%)
95(4.4%)
0(0.0%)
0(0.0%)
8.5(0.0%)
16(0.4%)
119.5(5.5%)

resource-consuming SHA1 module. Compared to the total available hardware resources of a mainstream FPGA device, the resource utilization of ARES is typically minor such that the majority
of logic resources can still be utilized for other tasks such as computing acceleration. Overall,
given its high security assurance and its relative low hardware usage, ARES proves to be highly
cost-effective for most embedded computing applications.

Performance Analysis
We experimented with an insecure system, a baseline, and an ARES prototype running seven
benchmarking applications selected from three well-studied benchmarking suites. As shown in
Figure 3.12, our ARES outperforms the baseline on all benchmarking applications. The perfor36

51.5

Insecure System
ARES
Baseline

44.2
30.5

29.7

29.6

1.0

24.8

20.5

19.9

20
0

37.3

34.5

1.0

1.0

STREAMCOPY STREAMSCALE

1.0

STREAMADD

1.0 3.0 3.7

1.0

STREAMTRIAD

FIR

IIR

Figure 3.12: Total execution time of benchmarking applications.

Bandwidth (MB/s)

40

30

Insecure System
ARES
Baseline

27.4
24.2

20
10

9.2 9.2

7.8

5.3

3.9

0

4.9
1.0 0.6

0.6 0.3

seq-write

seq-read rand-write rand-read

Figure 3.13: Memory access bandwidth in 8-byte granularity.
125

Bandwidth (MB/s)

Normalized Execution Time

60

107.1

100

Insecure System
ARES
Baseline

86.1

75
50
25
0

36.5
7.8 3.9

seq-write

12.112.1

33.9
6.5

2.5

8.1 4.6

seq-read rand-write rand-read

Figure 3.14: Memory access bandwidth in 64-byte granularity.

37

1.0

5.0 7.0

BINARYSEARCH

mance improvement between the ARES and the baseline depends on specific memory access patterns because there are multiple different-level caches within both ARES and baseline systems. In
the original Merkle tree scheme, the number of hashing operations depends on the caching status of each Merkle tree level and all the hash functions are executed sequentially. In contrast,
ARES Merkle tree scheme verifies all the Merkle tree levels in pipeline such that less performance
overhead are involved. A deeper analysis would be discussed later in this section. Across all the
applications, the performance of ARES is on-average 19 times lower than an insecure system. The
performance overhead of ARES over an insecure system is mainly determined by the caching behavior of the program since overhead is involved only when accessing the lower-level memory.
For instance, according to our measurements, the BINARYSEARCH application only incurs 13
memory accesses to a lower-level memory while other required accesses are cached, therefore the
performance of ARES is only 5 times lower than the insecure system. Quite differently, the IIR
application is a compute-intensive application, therefore the less-powerful MicroBlaze spends a
large portion of time on computation such that the memory access overhead of ARES is mitigated
by the computation. Another important factor determining the overhead of ARES is the cache
implementation. Currently in ARES, both the counter cache and the Merkle Tree cache are implemented as the direct memory-mapped cache, which typically has a lower cache hit rate against
a set-associative cache. The L1 cache of our MicroBlaze processor is also implemented using
a direct memory-mapped cache. Different cache configurations heavily affect the system performance overhead, however, the study of different cache configurations is out of the scope of this
work, and we leave this study to our future work. Compared to the baseline with a non-optimized
Merkle Tree scheme, ARES outperforms by 1.4 times. The performance improvement of ARES
is caused by the parallel verification, therefore the deeper a Merkle Tree is, the more performance
improvement ARES can benefit from.
The total time of running an application is spent on both computation and memory accesses. Given
38

that MicroBlaze operates on a relatively low clock frequency, the computation itself occupies a
large portion of the total execution time of each application. Therefore, we did another experiment
by measuring system performance with only sequential/random read/write memory accesses to
isolate the performance overhead of memory accesses. As shown in Figure 3.13 3.14, we tested
all three systems with two types of granularities: 8-byte accesses and 64-byte accesses. ARES
currently does not support partial accesses, therefore, when a cacheline is evicted from the L1
cache and partial bits are marked dirty, the ARES controller internally issues a data read request
first, modifies the returned data block, and then issues a data write request, hence incurring two
data accesses in total. As shown in Figure 3.13 3.14, when reading/writing data with a granularity
of 64 bytes, the memory bandwidth is typically higher than the access in 8 bytes. According to our
measurements, the only exception is sequential read, due to the fact that the read-only benchmark
does not require writing back dirty data cachelines and due to the sequentiality of the access pattern,
the caching behavior of each cache is the same, thus generating the same performance. Overall,
our experimental measurements indicate that ARES memory protection, on average, will decrease
the overall system performance by 13 times. However, if comparing our ARES system against
the baseline implementation with the traditional Merkle tree scheme, ARES outperforms for all
applications by up to 2.6 times.
To further investigate the impact of the Merkle tree caching scheme on single memory access, we
collected more detailed data by measuring the average data access latency with the 64-Byte memory benchmark. Since there are multiple caches, i.e. counter cache, and MT4∼MT0 cache, the
latency of a single data access varies. For example, when reading a data block, if its associated
counter block has already been cached in the counter cache, no verification is needed, consequently
the Merkle tree verification will not be activated. If missed in its cache, a counter block is required
to be verified first, therefore its associated last-level Merkle tree block (level-4 MT block) is desired. If a desired MT block is cached, the higher level is not necessary, while if the desired block
39

is missed in the cache, a upper level block is required to read. In result, the latency of a data access
highly depends on the caching status of each metadata cache. In our experiments, we measured
the average data access latency under 7 common cases, counter miss, counter hit, and MT4∼MT0
hit. The counter miss represents the scenario that a desired counter block is not in the cache. The
counter hit represents the scenario that a counter hits in the cache such that no counter verification
is required on a counter read (assuming there is no dirty block eviction). An MT4∼MT0 hit represents the lowest Merkle tree level which hits in the cache. Due to the Merkle tree hierarchical
structure, a higher level block has a better data locality than the child level. In other words, if MTx
block hits in the cache, MT(x − 1) is highly possible already in the associated cache. Therefore, in
our experiment, MTx hit refers to the case that all the upper levels MTx · · · MT0 hit in the cache,

ARES
Baseline

2000
1000
445 440

522 490

558 556

582 636

595

712

619

779

871
632

TE
MRH
T4 I
T
H
IT
M
T3
H
I
M
T2 T
H
I
M
T1 T
H
IT
M
T
CO
0
H
U
IT
N
TE
R
M
IS
S

0

CO

U
N

Latency (clock cycles)

while all the lower levels MT4· · · MTx are missed in the cache.

Figure 3.15: Average latency of data read.

As shown in Figure 3.15, the average read latency of the baseline increases as more Merkle tree
levels are required to be verified. From MTx to MT(x − 1), an extra level is required to verify
and consequently incurs an extra hashing operation. As shown in Figure 3.15, when the lowest hit
40

Latency (clock cycles)

600
500
404

400
300
200
100
0

29

12

AES

HMAC

Metadata Access

Time (s)

Figure 3.16: Breakdown of average read latency when counter hits within the cache. AES and
HMAC are executed in parallel when reading a data.

Sequential Write
Random Write

50
25
0

16.419.9

16.419.9

16.419.9

16.419.9

2

4

8

16

Figure 3.17: Impact of Osiris persisting threshold on runtime performance.

level increases, thus requiring more and more hashing operations, the read overhead of ARES increases very slowly compared to the baseline. Due to the fact that the baseline Merkle tree scheme
sequentially verifies Merkle tree levels, the overhead of the verification is linearly proportional to
the number of levels to verify. In contrast, since ARES verifies all the levels in parallel and the
hash unit can receive a new input every clock cycle, the increase of the latency is sublinear to the
increasing number of levels to verify. This performance difference can also be predicted from our

41

Counter Writes

×106

1.0

1048313

Sequential Write
Random Write

524156

0.5
31769

0.0

2

262078
0

7
4

8

131039
0
16

Time (ms)

Figure 3.18: Impact of Osiris persisting threshold on number of writes to persistent memory.
2000

1230.7
668.8

1000
0

1

387.4

2

4

Figure 3.19: Impact of recovery parallelism on recovery time.

performance model. According to our performance model in APPENDIX , when L decreases, the
0
change of data read latency for the ARES (Tread ) and the baseline (Tread
) is:

0
∆Tread
= Thash + Tmemrd + Tmt hit ≈ Thash
∆L

(3.1)

∆Tread
B
=
≈0
∆L
BWmemrd

(3.2)

where Tmemrd and Tmt hit denotes the latency of memory read and Merkle tree cache read hit
respectively, and BWmemrd denotes the bandwidth of reading memory. Equations 3.1 and 3.2
explain that when L increases, i.e. when more levels need to be verified, the data read latency of

42

Time (ms)

2173.8
2000
0

416.6

668.8

2

4

1172.6
8

16

Figure 3.20: Impact of Osiris persisting threshold on recovery time.

ARES is almost constant while the latency of the baseline is linear to L. However, as shown in
Figure 3.15, even when the counter cache is hit which is the scenario with the shortest latency, the
data access latency still has 445 clock cycles. To reveal the performance bottleneck, we further
broke down the data read overhead under the scenario of a counter hit by measuring the latency
of different components as shown in Figure 3.16. When a requested counter hits in the cache, its
average latency is 445 clock cycles, out of which 404 clock cycles are spent on calculating the
HMAC of the data. This long latency is largely because that the HMAC algorithm consists of a
nested two SHA1 operations.
The Osiris persisting threshold N , defined in Section 3, affects the counter cache performance
since a counter block must be persisted every N updates within the cache. Therefore, N determines how many counter writes and further affects the lifetime of an NVM device. We measured
the impact of N on the overall system performance and the number of counter persistence on memory benchmarks as shown in Figure 3.17 3.18. When N increases, the overall runtime performance
is barely affected because the counter persistence is not the key performance bottleneck. However,
the Osiris persisting threshold N can greatly affect the number of counter write-backs and consequently significantly affects the NVM lifetime. As shown in Figure 3.18, when N increases from
2 to 16, the number of counter persistence in a sequential write application is reduced by half for
each change. In contrast, for a random write application, the number of counter persistences typi-

43

cally is very small due to the fact that a counter is updated every N times before persisting its copy
back, which implicitly requires that this counter must reside within its counter cache at least for
N writes. However, in a random application, because the counter access pattern is arbitrary, each
counter cacheline cannot stay within the cache until being updated N times. In this scenario, when
increasing N from 2 to 4, the number of counter persistence is almost reduced to zero. Therefore,
the benefit of increasing the Osiris persisting threshold N highly depends on the memory access
pattern. For an application with a low locality such as a graph application, it may not benefit from
a higher N , but for applications exposing good data locality, increasing N can greatly increase its
NVM lifetime even though the runtime performance will not be much affected.

Recovery Performance
As shown in Figure 3.19 3.20, we measured the time to recover all counters and Merkle tree nodes
after a system crash. To precisely emulate the system crash, in our experiments, we first ran a
sequential write application such that metadata cache is completely filled with dirty cachelines.
Subsequently, we intentionally cleared all cache contents, including both counter cache and MT
caches, without writing them back to the main memory in order to emulate a system crash where
all the metadata within the volatile cache get lost. Afterwards, the recovery module is activated to
traverse counters and Merkle tree nodes to locate and recover all stale metadata. As mentioned in
the previous section, the total recovery time depends on both the Osiris persisting threshold N and
the parallelism factor P .
Impact of recovery parallelism—We first measured how the recovery time varies as the recovery
parallelism P changes. As shown in Figure 3.19, when P increases from 1 to 2, the recovery
time is shortened due to the increased recovery parallelism. The recovery performance can also be
accurately predicted from the performance model in Equation .10 and Equation .12. In these per44

formance models, both counter recovery time Tctr and Merkle tree recovery time Tmt are inversely
related to the parallelism factor. However, when increasing the parallelism factor from 2 to 4, the
overall speedup decreases because the Merkle tree rebuilding is bounded by the memory bandwidth. As shown in Equation .10 and Equation .13, when the parallelism factor P is higher than
4, the total amount of peak memory bandwidth required is 23 GB/s (64B×100MHz×4), which
exceeds the DDR4 full bandwidth (19 GB/s), therefore the Merkle tree recovery process is upper
bounded in its performance.
Impact of Osiris persisting threshold—When recovering the counter ctr, all the counters within the
range [ctr, ctr + N ) are tried to find a correct counter such that the counter recovery time doubles
when N doubles. As mentioned in Figure 3.20, when N increases from 2 to 16, the recovery time is
greatly increased. Since the Osiris persisting threshold only affects the counter recovery time while
the Merkle Tree is not affected, the recovery time is sublinear to N . Additionally, as confirmed
with the performance model described in Equation .10 and Equation .12 (derivation details can be
found in APPENDIX ), the counter recovery time Tctr is linear to the Osiris persisting threshold
N while the Merkle tree rebuilding Tmt does not depend on N . Overall, our measurements of
other performance metrics also closely matched with our analytical model, hence the model can
be utilized as an effective estimation tool.

Conclusion

With cyberattacks rapidly surging, ARES offers a promising methodology to bolster the persistent
security and trust-worthiness of today’s embedded computing and IoT platforms, especially those
equipped with the emerging byte-addressable NVM technology as their main memory. Overall,
our comprehensive study of ARES, with both hardware prototyping and analytical modelling, has
demonstrated that the ARES scheme not only is feasible through fully leveraging the inherent logic
45

reconfigurability of modern heterogeneous SoCs, but also has shown impressive security assurance
with only reasonable hardware and performance overheads.

46

CHAPTER 4: HERMES: HIGH-THROUGHPUT SPECULATIVE BMT
OPERATION

Motivation And Contribution
Root
Lvl
W(C1) H(C1) U(A1) H(A1) U(B1) H(B1) U(R)
Ctr Lvl

MT1 Lvl

MT0 Lvl

No overlapping between C1 and C2 updates
W(C2) H(C2) U(A2) H(A2) U(B1) H(B1) U(R)
(a)

Time

W(C1) H(C1) U(A1) H(A1) U(B1) H(B1) U(R)
W(C2) H(C2) U(A2) H(A2) U(B1) H(B1) U(R)
Initial Interval
C1 and C2 updates are overlapped
(b)

Figure 4.1: (a) Blocking BMT implementation: C1 and C2 are updated sequentially. (b) HERMES
BMT exploiting task-level parallelism: C1 and C2 updates are overlapped.

Even though ARES improves BMT performance over a naively flattened BMT implementation,
each BMT operation is blocking and sequential such that a new incoming counter request has to
wait for the completion of previous one before being processed. Consequently, counter operation
bandwidth is dominated by the latency of BMT operations. Especially, when updating a counter,
all higher-level BMT nodes must be updated such that the counter update bandwidth is severely
limited by the time-consuming BMT update. As shown in Figure 4.2, when persisting a data to
NVM, which is protected by a 5-level BMT, the latency of sequentially updating a BMT is 800
clock cycles. As such, the blocking and sequential BMT update of ARES limits the bandwidth
47

Latency (clock cycles)

1000

800

800
600

404

400
200
0

12
AES

HMAC

5-Level BMT

Figure 4.2: Computing latency of different security primitives used in ARES. In ARES, the latency
of SHA1 is 160 clock cycles, such that updating a 5-level BMT requires at least 800 clock cycles,
without considering NVM access latency.

of counter update, consequently decreasing data persistence bandwidth. Without novel computing
optimizations, the crash-consistent security primitives for NVM badly decreases available memory
bandwidth such that they are still impractical to be used in a real system. To further improve
counter operation performance, we designed a dataflow architecture such that multiple incoming
counter requests can be processed in flight.
Stored in persistent register

Root

Stored in NVM

8B

B1

Hash
A1 8B

…

A2
Hash

64B

C1
Encryption Coutners

…

Merkle Tree Nodes

Secure & Persistent Boundary

C2

Hash: 64B->8B

Figure 4.3: An example 2-level Bonsai Merkle tree.

We use a 2-level BMT as an example to illustrate the performance overhead of BMT used in
48

prior FPGA-assisted secure memory works and to demonstrate the performance of our proposed
dataflow optimization. Challenges and solutions will be detailedly discussed in later sections. Each
leaf BMT node is a counter cacheline containing 64 counters (we assume split-counter scheme)
while each counter is associated with single 64-byte data cacheline. As shown in Figure 4.3, two
counter update requests C1, C2 come in a sequential order. When updating C1, A1, B1 and root
R are sequentially updated by calculating a new hash value of the lower-level BMT node and overwriting the associated 64 bytes within a BMT cacheline. Similarly, when updating C2, A2, B1, R
are sequentially updated. For simplicity, we assume A1, A2, B1 are all previously cached in each
level’s BMT cache, such that no verification of BMT nodes is required. From our synthesis result from an fully optimized Vivado HLS implementation, single SHA1 takes 24 clock cycles, and
consequently each counter update requires 72 clock cycles before transforming a counter update
into a root persistence. ARES only focuses on latency optimization and the BMT operation for
each counter is blocking such that C2 has to wait before C1 update is completed, as shown in Figure 4.1 (a). In the example, each counter write operation (W) consumes 20 clock cycles and each
update (U) consumes 0 clock cycles, such that the counter update bandwidth is up to 0.7MB/s with
a 100MHz clock frequency. As such even though BMT latency is optimized, counter update bottlenecks system throughput especially in write-heavy applications where counters are frequently
updated.
By exploring the update pattern of BMT, the proposed HERMES optimized BMT throughput in
addition to latency optimization adopted in ARES. Essentially, HERMES designs an architecture
that an upper level is only dependent on the immediate lower level such that multiple outstanding
counter updates can be simultaneously processed as shown in Figure 4.1 (b). Following the same
system setup, HERMES BMT counter update bandwidth is 1.5MB/s, where the counter update
throughput is only dependent on the latency of the counter level (BMT leaf level). Analytically,
given a L-level BMT covering a total memory of size 4 × 8L KB, the counter update bandwidth of
49

ARES is

64B
(L+1)×Thash +Twrite

while the bandwidth of HERMES is

update bandwidth is increased by

(L+1)×Thash +Twrite
Thash +Twrite

64B
,
Thash +Twrite

such that the counter

≈ L + 1.

Towards the high-level goal, we made following contributions in this work:
• Conducted deep analysis of technical challenges when designing a dataflow architecture on
FPGAs which relies on exploiting spatial computing capabilities.
• Proposed three efficient solutions for the challenges such that BMT can be speculatively
operated compliant with TEE security requirement.
• Designed a speculative dataflow architecture with a capability of processing multiple outstanding counter requests simultaneously and integrating both counter update and counter
verification within the same set of hardware.
• Implemented on Xilinx U200 FPGA board and evaluated using a state-of-the-art memory bandwidth benchmark, HERMES achieved a 70.7MB/s counter update bandwidth and
156.3MB/s counter verification bandwidth, which are higher than the state-of-the-art by 5.3
and 7.9 times.

Challenges

Despite of the performance benefit of HERMES dataflow architecture, there are three technical
challenges when designing an efficient dataflow architecture exploiting task-level parallelism as
well as strictly following TEE’s security requirement.
Challenge 1: The update of each level can only be executed after all levels are verified, consequently each level is dependent on all levels. In a secure environment, all transactions on the
off-chip memory bus are not trusted and consequently require verifying the integrity. In BMT, only
50

Verify

R
3. Send new hash

A1
B1
C1

B2
C2

2. Persist

C3

C4

Ctr update

Ctr

Cannot continue
since B2 is not
veriﬁed yet

1. Merge
update
Cannot merge
since C4 is not
trusted

(a)

Cannot merge update
since C4 is not cached
even though B2 is cached

R

Can safely merge update
since B1,C1 are cached
and trusted

B2
C2

Upd 2

A1 and R needs to be updated
even though B2 is cached

A1

A1

B1
C1

Cached node

R

C3
Upd 1

Veriﬁcation chain
ends with a cached
BMT node

C4
Ctr

Cached node

B2

Verify ctr

(b)

C4

Update chain
ends with root

Ctr

Update ctr

(c)

Figure 4.4: (a) Challenge #1: B2 is not verified yet thus are not allowed to 1) persist B2 to NVM
after merging the update from the lower level or 2) send the calculated hash over the updated B20
to upper level, otherwise violating security rules. C4 is not verified thus the update passed from
3rd level should not be safely merged. (b) Challenge #2: Even though B2 is cached and trusted,
A1 cannot merge the update since C4 is not cached. (c) Challenge #3: A counter update chain
always ends with root level while a verification chain early terminates when a node is cached.

root node is stored in on-chip non-volatile register while all intermediate BMT nodes are updated
in NVM. Note that for simplicity, we temporarily assume there is no BMT cache, and we will
discuss the caching behavior in Challenge 2. Thus when operating BMT, two security rules must
be strictly followed.

Security Rule 1 When reading a BMT node from off-chip NVM, the node must be verified before
used (either to verify other BMT nodes or to be updated).

Security Rule 2 When persisting a BMT node to NVM, the node must be trusted.

For example, as shown in Figure 4.4 (a), when persisting a counter, the task at second level consists
of 3 tasks: 1. Read B2 from NVM and merge the update sent from third level; 2. Persist the
updated BMT node B20 back to NVM; 3. A new hash value is computed over B20 and send the
newly calculated hash value to the upper level. Ideally, after finishing the three tasks, the second
level can process the next input, however when reading B2 from NVM, since B2 is not verified yet,
51

B20 is not allowed to persist into NVM. Also, there is no point of sending the hash value computed
over the unverified B2 to the upper level. Finally, since C4 is also unverified, an update calculated
from C4 is also untrusted and consequently should not be merged by B2. Thus as indicated by
blue arrows in Figure 4.4 (a), B2 and C4 must be verified by using R to verify A1 first, using A1
to verify B2, and using B2 to verify C4. Consequently the update of B2 depends on all other
levels, rendering it challenging to design a dataflow architecture where each stage only depends on
previous stages. In addition, a BMT level cannot immediately process a new input and abandon the
result computed over the previous input since the output will be used by all other levels. As such,
an efficient hardware architecture which pipelines BMT updates across levels while not violating
the security rules is desired.
Challenge 2: When using BMT cache, BMT update behavior at each level is tightly correlated
with caching status of all BMT levels.We adopt an architecture similar to ARES, where each level
is associated with a separate BMT cache. As mentioned in Challenge #1, the logic of each level
(stage) consists of three tasks, however, the integration of an BMT cache increases the complexity
of each level’s logic and causes the BMT node update logic bifurcates with different caching status
of current level node as well as lower-level nodes. As shown in Figure 4.4 (b), the logic of certain
BMT level depends on the caching status of the current level and lower levels. When the current
level and all the lower levels are already cached, then no verification is required since BMT caches
are included within the TCB such that all cached contents are trustful. As such there is no need
to verify current level. Moreover, since the update sent from the trusted immediate lower level
is also trustful, the update sent from immediate level can be safely merged (M) to update current
level BMT node. The updated BMT is persisted into cache (P) and a hash is computed and sent to
upper level (S). When current level is uncached, current level is read from NVM and consequently
a verification of current level (VC) is required ahead of other operations. When not all lower levels
are cached, current level has to wait until all lower levels are verified (VL). The reasoning here
52

Table 4.1: Each level’s update behavior depends on caching status of current level and all lower
levels.
Current Level

All Lower Levels Behavior

Cached

Cached

Cached

Uncached

Uncached

Cached

Uncached

Uncached

1
2
3
4
5

M1→P2→S3

VC4→M→P→S
VL5→M→P→S

VL→VC→M→P→S

Merge update from child level
Persist to cache
Compute hash and send update to parent level
Verify current level
Verify all lower levels

is that each counter update at the counter level results in a update at immediate upper level and
the immediate upper level generates an update by merging the update and send to the higher level
and this process is repeated at each level, thus essentially an update received at certain level is a
gathering effort of all lower levels. Consequently, when a level attempts to merge updates received
from immediate lower level, not only the immediate lower level must be verified, all lower levels
must be verified first. As such, based on caching status of current level and all lower levels, there
are four cases when updating certain level as shown in Table 4.1.
Challenge 3: Counter update and verification exhibits two totally different datapaths.So far,
we have only discussed the counter update operation, however, when verifying a counter read from
an off-chip NVM, the BMT operation is different. To keep the root persistence is consistent with
counter persistence, a BMT update chain does not end until updating and persisting root level. In
contrast, when verifying a counter, the verification chain can early terminate when a BMT node
at certain level is cached, since the cached node is trusted and can be safely used as a “root”. As
shown in Figure 4.4 (c), when updating a counter, all upper-level BMT nodes (C4, B2, A1, R)

53

need to be updated no matter caching status of each node. In contrast, when reading and verifying
a counter, only C4 is read from NVM and verified, while B2 is already cached and trusted, and
consequently higher-level node A1, R is not used. Thus, when reading and verifying a counter
from off-chip NVM, the verification chain ends with either a cached BMT node or root node.
Therefore, when designing a dataflow BMT architecture, the difference of BMT operations need
to be considered otherwise the latency of either operation will be sacrificed. For example, to
combine counter verification and counter update within the same framework, a naive way is to
extend counter verification chain such that the verification chain ends with root node, the same as
BMT update. However, since BMT nodes above the cached level is not necessary to be read and
verified, this simple combination sacrifices BMT verification latency, despite a uniform data path.
As such, a uniform framework which combines both counter verification and counter update is
desired, while the framework can self-adapt to two types of operations to simultaneously achieve
the shortest latency for each operation.

Methodology

Aiming at the three challenges, HERMES proposed three effective hardware solutions, such that
the proposed high-throughput dataflow architecture (1) strictly follows the security rules, (2) efficiently leverages BMT cache, and (3) combines both counter verification and update into an
uniform framework while only adds an ignorable overhead to the latency of verification.
Solution 1: Speculative execution. Traditional BMT scheme designed for von-Neumann architecture heavily utilizes speculative execution to reduce memory verification overhead. However,
trivially deploying speculative execution concept from sequential executional CPUs to spatial computing devices like FPGAs will add a significant complexity to hardware design. As such, HERMES proposed a simplified and hardware-efficient speculative execution scheme specific to BMT
54

1. Read B2 from SB
C4
Ctr

(a)

3. Merge udpate
Update ctr

Persist

A1

A1

SB hit

B2

Miss
Depend on cache and SB

(b)

SB
B2

C4
Ctr

Ctr level always miss

…

Update ctr

Parallel Veriﬁcation

B2

Cache hit

Cached
R

NVM

B2

SB

Cached
Cache

…

5. Buﬀer

Parallel Veriﬁcation

A1
2. Push to PV

Root level always hit
R

Send rsp if veriﬁcation succeeds
4. Send new hash

…

Parallel Veriﬁcation

R

A cached/buﬀered node
will not propagate upwards
when verifying counters

A1
SB hit

B2

Cache miss
VERI

C4
Ctr

<VERI, 0, addr>
Verify ctr

(c)

Figure 4.5: (a) Solution #1: A speculative buffer (SB) and a parallel verification module (PV) are
added to verify all levels in parallel and speculative buffer unverified BMT nodes. (b) Solution
#2: PV adapts skips parallel verification if all levels are either cached or pre-buffered in SB. (c)
Solution #3: Each level propagates operation and caching status of current level to upper level
such that a verification chain can early terminate when a node is cached. PV adaptively adjusts
verification logic according to operation and caching status of levels.

update operation as shown in Figure 4.5 (a). A separate speculative buffer (SB) is associated with
each BMT level, and a parallel verification module is designed to verify all BMT levels in parallel.
Compared to Figure 4.4 (a), three extra steps are added to each level. When each level receives an
update sent from immediate lower level, the BMT node to update is first searched within speculative buffer. If the node exists in SB, SB directly returns BMT node to the logic, otherwise the node
is read from NVM. If the node is read from NVM, the returned node is pushed into parallel verification unit (PV). Meanwhile, the incoming update is merged to the BMT node and the merged
one is speculatively buffered in SB. A hash value calculated over the updated BMT node is sent
to upper level. PV is connected with all BMT levels, root level, and counter level, using separate
channels. PV verifies all the unverified BMT nodes in parallel since root level is connected with
PV and root node can always be used as a verified reference. PV instantiates either parallel hash
units to compute hashing values of all levels in parallel or single pipelined hash unit to reuse single
module to compute hashing values of all levels in flight. Even though some nodes are already
cached and there is no need to verify, verifying an unnecessary level adds no performance penalty
due to the verification is done in parallel. When PV completes one verification (all BMT levels
55

are verified in parallel), a response is issued to each SB to notify that one temporarily buffered
updated BMT node can be persisted in NVM since the verification succeeds. Once receiving a
response, SB pops up a buffered BMT node and writes back to NVM. By temporarily buffering
non-verified updated nodes into SB, the security model is not threatened, since unverified updated
node is not directly persisted in NVM, thus no security weakness occurs. Also, by searching the
desired node in SB before reading from NVM, the BMT node used is always the most recent copy
such that write-after-write (WAW) conflict is avoided. SB is implemented using as a circular queue
and each successful verification pops up the first entry, therefore, when a verification fails under an
adversarial attack, SB is blocked by the first unverifiable entry and consequently hangs the memory
subsystem, which is a desired behavior when using a secure memory subsystem. Taking Figure 4.5
(a) as an example, when B2 receives an incoming update from an unverified C4, B2 is first read
from SB and the buffered B2 is returned. The returned B2 is pushed into PV while a new B20
merging the update is appended to SB. Meanwhile a new hash value H(B20 ) is sent to A1. When
PV completes one parallel verification, the oldest entry in SB B2 is popped and persisted back to
NVM.
Solution 2: Cache-hit-aware parallel verification. As mentioned in Challenge 2, when caching
is integrated with BMT, the caching status (cache hit status) of current level and lower levels decide
behavior of each level. To accommodate this behavior difference, the verification logic of PV is
dependent on the caching status of inputs from all the channels. When each level sends a unverified
node to PV, a flag signal indicating the caching status of BMT node is sent along with data. Due
to Solution 1, each level is associated with three memory hierarchies, SB, cache, and NVM, with
SB of the highest hierarchy. When persisting a counter, the counter level is unnecessary to verify
thus counter level inputs are ignored by PV during the counter update process. Root is stored
in an on-chip non-volatile register and always trusted, thus root level always issues a hit signal
to PV. Except root level and counter level, verification of all internal BMT levels are dependent
56

on caching status. When a BMT node is either buffered in SB or cached, the hit flag is asserted
otherwise the flag is deasserted. Note that the node buffered in SB can be safely marked as hit
due to the speculative execution in Solution 1. As mentioned in Table 4.1, during the counter
update, only when either current level is uncached or at least one of lower levels are uncached, a
verification is required. Taking the parallelism of PV into consideration, that verifying single level
and verifying all levels in parallel will not make a negligible performance difference, PV’s logic
is designed as shown at Line 6-15 in Algorithm 4. Essentially, during the counter update, when
all BMT levels are hit PV skips verification and directly returns response to all levels, while when
at least one of BMT levels are missed PV verifies all levels in parallel before sending a response.
Leveraging the cache-hit-aware parallel verification, HERMES achieves the shortest latency for
each update chain with a different caching status on each level.
Algorithm 4 Parallel verification algorithm
1: L: BMT height (excluding counter level and root level)
2: M : verification mask for all levels (including counter and root)
3: H: cache hit flag of all levels (including counter and root)
4: while Counter channel is not empty do
5: if Counter update request then
6:
D ←Read data from all channels
7:
O ←Read offset from all channels
8:
if H all hit then
9:
return Success
10:
else
11:
for l ← L + 1 to 0 do
12:
Ml ← not Hl
13:
end for
14:
return V ERIFICATION(D, O, M )
15:
end if
16: else if Counter verification request then
17:
DL+1 ← Read counter
18:
ML+1 ← 1
19:
for l ← L to 0 do
20:
if Hl 6= 1 then
21:
Dl ← Read data from channel l
22:
Ol ← Read offset from channel l
23:
Ml ← 1
24:
else
25:
Ml ← 0
26:
end if
27:
end for
28:
return V ERIFICATION(D, O, M )
29: end if
30: end while
31:
32: function VERIFICATION(D, O, M )
33: for l ← L+1 to 1 parallel do
34:
if Ml and H(Dl ) 6= Dl−1 (Ol−1 ) then
35:
return Error
36:
end if
37: end for
38: return Success
39: end function

57

Solution 3: Operation-aware message passing. As discussed in Challenge 3, when integrated
with BMT cache (and SB), there is a great logic difference between verifying a counter and updating a counter. In the dataflow architecture, a higher level is only dependent on the immediate
lower level, such that a message indicating the operation (counter update or verification) must be
propagated between levels. As shown in Figure 4.5 (c), HERMES defines a message format passed
between neighbor levels as a tuple hop, hit, addri, where op indicates if current level is within a
counter update chain (UPD) or a verification chain (VERI), and hit indicates the cache hit status
of the lower level. The op is propagated from counter level and counter level issues VERI when
reading a counter from NVM while issuing UPD when persisting a counter to NVM. By passing
the hit status of current level to the immediate upper level, the upper level can early terminate the
verification chain when a lower level is cached since a cached level can safely be used a “root” and
thus no need to verify upper levels. In contrast with updating a counter where all levels will push
nodes into PV channels, when early terminating a verification chain, levels above an cached level
will not push nodes into channels, such that PV needs to distinguish between updating a counter
and verifying a counter. Consequently the parallel verification logic is extended based on the logic
designed in Solution 2 to deal with the early termination of a verification chain. As shown in
Figure 4.5 (c), counter level pushes a counter operation (VERI or UPD) flag into counter channel
on each counter request. Each BMT level pushes the node to a PV channel only when an incoming message passed from immediate level indicates the lower level is uncached (neither cached
nor buffered in SB). Note that whether pushing current level node to the channel is dependent on
caching hit status of immediate lower level instead of current level. The reasoning here is that even
though current level is cached, the level still needs to push the node into PV in order to verify the
immediate lower level if the lower level is uncached. The extension of PV’s logic is as shown at
line 17-28 in Algorithm 4. PV first queries counter channel to decide the current operation and if
the current operation is to verify a counter, PV reads each input channel in an order starting from
leaf level. When an entry read from a channel at certain level is marked cache hit or SB hit, PV
58

stops reading upper levels and starts verifying all the levels in parallel. A bit mask indicating if
each level should be verified is updated when reading each channel and used to mask the result
of the parallel verification. Leveraging this mechanism, both counter update and verification are
integrated within an unified framework while achieving the minimal latency for both operations.
Conclusion. Aiming at the three design challenges studied in Section 4, HERMES proposed a
speculative buffer to speculatively pipeline counter updates between levels to improve throughput
without violating the threat model’s requirement. An parallel verification (PV) hardware module and an adaptive algorithm to adapt to difference caching status of BMT levels while always
achieving the minimal latency. An uniform message format passed between levels to accommodate the early termination of BMT upward traversal when verifying a counter, and an extension to
PV algorithm to intelligently adapt to both operations, such that both counter update and counter
verification can leverage the same dataflow architecture achieving shortest latency.

Architecture Design And Hardware Implementation

System Overview
As shown in Figure 4.6, HERMES consists of single counter stage, multiple MT stages, and single
root stage. Each stage is associated with one level in a BMT, while each level receives incoming
messages from immediate lower level and generates message to push to upper level. Whenever a
level pushes the message to upper level, current level can immediately process the next message
without waiting for completion of the whole update chain or verification chain. Parallel verification
unit (PV) is connected with all stages with separate channels. Whenever a verification completes
and succeeds, PV sends a response signal to all stages asynchronously to signal one speculative
execution can be committed. If verification fails, PV hangs such that subsequent verification re59

<id,op,data>

PV Req

<hit,oﬀset,data> …

Lower Level

PV Rsp

… Root Stg
Logic

SB

Root SB

<root>

BMT NVM
Cache

<upd>

is_verify

SHA1

is_verify

MT Stg
Logic
<hit,rdata>

ID Table

…

…

Usr Req
Usr Rsp

SHA1

is_verify

<op,data>

Ctr Stg
Logic

PV Req

…

<hit,op,idx,oﬀset,upd>

…

PV Rsp

<hit,op,idx,oﬀset,upd>

…

Upper Level
…

Upper Level
PV Req … PV Rsp

NVM
…

Lower Level

Repetitive Sequential
Traversal

Ctr Stage

MTL-1 Stage

MTL-2 Stage

…

Parallel Veriﬁcation Unit
SHA1

MT1 Stage

MT0 Stage

Root Stage

HERMES

NVM
Usr Logic
Memory Paer Generator

Figure 4.6: HERMES system overview. By decoupling stages using data buffers, stages work
independently and asynchronously. The system is elastic such that MT stages can be flexibly
extended when processing a Bonsai Merkle tree of arbitrary height.

quests will not be processed anymore and thus hangs the whole BMT subsystem eventually, which
is a behavior normally desired when using secure memory. The message format passed from a
lower level to a parent level is (curr hit, op, next idx, next of f, update). curr hit and op indicate cache hit status of child level and what operation is being executed (either to verify a counter
or to update a counter), such that the parent level determines logic accordingly. next idx and
next of f is calculated and sent at child level indicating the index of desired parent node at parent
level and the offset of desired 8-byte word at the parent node. Each MT stage mainly consists of a
MT stage logic, a SHA1 engine, a speculative buffer (SB), and a BMT cache, while each stage is
decoupled from other components by connecting with buffers. Root SB is different from BMT SB
since root node is a single register with pre-defined address and thus there is no need to store metadata. Counter stage does not need an SB since all the counters must be immediately persisted into
NVM. However, TEE requires that only when counter updates and counter verification completes,
a counter response could be sent to upper logic signally the integrity, thus to process multiple
outstanding counter requests in flight, an id table is deployed in counter stage with a format of

60

(id, op, data). Similar to SB, when receiving a counter request, the metadata is registered at id
table, and when asynchronously receiving a response from PV, the first entry registered at the table
is popped up and responded to upper logic. In a real secure memory subsystem, there are other
components ahead of BMT, such as AES encryption and HMAc, even though BMT was already
proved to bottleneck the system performance. However, to isolate performance improvement of
HERMES BMT, a highly flexible memory pattern generator was used to emulate counter requests
generated in a real system, such that we can collect peak BMT performance.

Base addr

rsp_ready
&

Occupied
Empty

rsp_verify

PV Intf

Speculative Buﬀer
<op,dirty,data,idx>
reverse order

SB_hit

Scan SB in

+1 wr_ptr
wr_addr
CMP

Stage Logic Intf

Speculative Buffer

rd_ptr +1
Cache Intf

Figure 4.7: Speculative buffer. V : dirty, U : update, CL: clean, DT : dirty, D: data, I: index of the
64-byte BMT node at current level.

Required by the strict persistence model, persistence of BMT updates should follow the order
of counter updates, thus SB is implemented as a circular buffer, where rd ptr points at the next
entry to pop while wr ptr points at the next entry to fill. As shown in Figure 4.7, each entry
consists of four fields: 1-bit operation flag, 1-bit dirty flag, 64-byte data, and K-bit index. To save
memory storage overhead, HERMES save an index of the BMT node at current level instead of
a 64-bit address. Due to the BMT structure, the maximum number of BMT nodes at each level
is different thus the index width for SB at each level is different. For simplicity, we set the index
width K as the maximum number of nodes at leaf level which contains most number of BMT
61

nodes. Consequently, given a 8-ary BMT of L levels leaf level contains 8L−1 BMT nodes and
index with is set as K = log2 (8L−1 ) = 3 × (L − 1). For a 7-level BMT protecting a data range
of 8GB, memory usage is reduced by 8% when using index. The address is later calculated when
writing an entry back to BMT cache. When storage logic searches for existence of an BMT node
within SB, currently filled entries in SB (entries between rd ptr and wr ptr) are reversely scanned
starting from (wr ptr − 1) location. The index of each entry is compared with address being
searched to find a match and a hit signal and the hit entry is returned to stage logic. All the entries
can be either compared in parallel or compared on by one, this involves a performance trade-off.
When comparing all entries in parallel, a large hierarchical MUX will be generated resulting in a
high hardware usage (a MUX area usage is almost the same as ) and a small timing slack, despite
of a comparing latency of single clock cycle. Sequentially searching entries result a compact
hardware however adding extra clock cycles to stage logic when a deep SB is used. Due to the
high spatial locality of BMT node access (e.g. each leaf BMT node protects a 32KB data range),
the same BMT node at each level is possibly repeatedly accessed and thus when reversely scanning
SB the first entry is likely the one searched. Therefore, the parallel comparison will unlikely help
improve performance and a shallow SB with sequential comparing is sufficient, which is adopted
in HERMES. On write side, wr ptr is incremented each time a speculatively-used BMT node is
pushed. On SB’s read side, when PV signals a verification completes and succeeds, the entry
pointed by rd ptr is popped out and written back to BMT cache.

Parallel Verification Unit
As mentioned in Algorithm 4, PV adapts verification process depending on operation flag received
on counter channel and hit flag on BMT channels. When verifying all levels in parallel, either
multiple SHA1 instances can be used or single pipelined SHA1 module is reused, and in our
current implementation we chose the latter due to the smaller hardware usage compared with par62

allelization. SHA1 was theoretically proved attackable even though it is still widely adopted, thus
HERMES modularized hashing units such that other secure or efficient hashing implementations
can be plugged in.

Stage Logic
Logic of each stage is shown in Algorithm 5. As shown in Figure 4.6, stages are decoupled from
each other by connecting stages using FIFOs, and the logic at each level is only determined by
current level status and a message passed from the child level. Among three types of stages, i.e.
counter stage, MT stage, and root stage, MT stage is the most typical logic and other two stages
only have minor differences on the logic, thus we explain BMT stage logic in details. When
receiving a request from a message buffer connected with the child level, the operation flag op
and the cache hit flag hit are first checked. When op = 0 and hit = 1, i.e. previous stage (child
level) is hit either in SB or cache and a counter verification chain is currently executed, then there
is no need to current stage. Therefore, current stage passes cache hit status to higher level without
any processing required. Note that when a desired BMT node is currently buffered in SB, we also
mark this BMT node as a cached node since this node can be speculatively safely used due to the
speculative execution. When op = 1, i.e. a counter update chain is currently executed, the required
BMT node bmt is read from SB if hit otherwise BMT cache. The desired 8 bytes within the 64byte cacheline is updated to upd sent from lower level generating a new updated BMT node bmt0 .
A new hash value is computed over bmt0 resulting in a hash value H = hash(bmt0 ). H and other
information are generated and sent to upper level logic, while bmt0 is appended to SB marked as
dirty since it is updated to a new value. A tuple containing cache hit information, pre-update BMT
value, and word offset is pushed into PV channel to wait for the node to be verified if it is retrieved
from untrusted NVM. When op = 0 and hit = 0, i.e. previous stage (child level) is missed both
in SB and cache and a counter verification chain is currently executed, the process is similar as the
63

Algorithm 5 Logic at each stage
1: function C OUNTER STAGE
2: while 1 do
3:
if There is counter request then
4:
Calculate index and offset of stored hash value at parent level
5:
if Read counter then
6:
Read ctr from NVM
7:
Register (req id, ctr, op = 0) to id table
8:
Push (op = 0, ctr) to PV
9:
Push (op = 0, hit = 0, idx, of f, upd = 0) to upper level
10:
else
11:
Write ctr to NVM
12:
Register (req id, ctr = 0, op = 1) to id table
13:
Push (op = 1, ctr) to PV
14:
Compute hash H ← hash(ctr)
15:
Push (op = 1, hit = 0, idx, of f, upd = H) to upper level
16:
end if
17:
end if
18: end while
19: end function
20: function MT STAGE
21: while 1 do
22:
if There is MT request from lower level then
23:
Read message from lower level (op, hit, idx, of f, update)
24:
Calc index and offset of stored hash value at parent level
25:
if op = 0 then
26:
if hit = 1 then
27:
Push (op = 0, hit = 1, idx = 0, of f = 0, upd = 0) to upper level
28:
Done
29:
end if
30:
Read bmt from SB or BMT cache
31:
hit ← cache hit or SB hit
32:
Push (op = 0, hit = hit, idx, of f, update = 0) to upper level
33:
Push (hit = 0, of f, data = bmt) to PV
34:
if hit = 0 then
35:
Register (op = 0, idx, dirty = 0, bmt) to SB
36:
end if
37:
else if op = 1 then
38:
Read bmt from SB or BMT cache
39:
hit ← cache hit or SB hit
40:
bmt0 = upd(bmt, update)
41:
Compute hash H ← hash(ctr)
42:
Push (op = 1, hit = hit, idx, of f, update = H) to upper level
43:
Register (op = 1, idx, dirty = 1, bmt0 ) to SB
44:
Push (op = 1, hit = hit, of f, data = bmt0 )to PV
45:
end if
46:
end if
47: end while
48: end function
49: function ROOT STAGE
50: while 1 do
51:
if There is MT req from lower level then
52:
Read message from lower level (op, hit, idx, of f, update)
53:
if op = 0 then
54:
if hit = 1 then
55:
Done
56:
end if
57:
Read root from SB or root register
58:
Push root to PV
59:
else if op = 1 then
60:
Read root from SB or root register
61:
Push update to SB
62:
Push root to PV
63:
end if
64:
end if
65: end while
66: end function

64

update process while no hashing is needed. We refer readers to Algorithm 5 and Figure 4.6 for
difference of instructions and hardware structures at different stages.

Benchmark
Throughout this work, we only focus on BMT performance including both latency and throughput
since it bottlenecks memory performance while ignoring other components such as AES encryption, HMAC, and ECC. To isolate BMT performance, we used a memory pattern generator to
generate counter requests following a pattern of repetitive sequential traversal, which was commonly used in memory bandwidth measurement Wang et al. (2020). We designed a parameterized
benchmark module to generate counter requests of an arbitrary configurable RST pattern. Similar
to Shuhai benchmark Wang et al. (2020), the benchmark module generates multiple outstanding
counter requests to stress-test the maximal throughput of BMT. Latency of each counter request
is also measured and the average latency is reported. Configurable parameters of the benchmark
include initial address A, number of transactions N rT , working size W , stride S, and burst size B.
Unlike other works, HERMES fixed B to 64 bytes since in a real secure memory subsystem, BMT
is connected with a counter cache and it is rare that more than one consecutive counter persistence
or counter verifications are generated in a single burst. To warm up BMT cache, the benchmark
pre-runs a pattern once before collecting performance to make sure BMT cache is filled and steady.

Experiment And Result Analysis

Experiment Setup
HERMES was implemented and tested on a Xilinx U200 FPGA acceleration card equipped with 4
16GB DDR4 memories. SHA1 module was developed with Vivado HLS, while other components
65

were synthesized using Vivado 2020.2. In the experiment, we compared with ARES as a baseline
to verify the effectiveness and also compared with an insecure system (INSECURE) to study the
performance overhead. Similar to Shuhai Wang et al. (2020), we benchmarked the counter access
performance with a memory access pattern widely used, Repetitive Sequential Traversal (RST), as
mentioned in Section 4. In this work, we only focused on improving BMT performance, and BMT
was thus separately tested. When running each benchmark, ten iterations are repeated while the
collected statistics at the first iteration are abandoned. All BMT caches are empty at the first run and
the performance is consequently very low, therefore the first iteration was used to warm up BMT
caches. The performance impact on a secure system not only depends on BMT but also depends on
other components such as AES, counter cache, and HMAC. The system-level optimization will be
discussed in future works. We evaluated both HERMES and ARES with a 7-level BMT covering
a data region of 8GB. In the original work of ARES, only a 5-level BMT was studied covering a
128MB data region and we customized the open-sourced code. There is a separate BMT cache
for each level, we allocated cache according to Table 4.2, such that the total cache size (260KB)
was almost the same as the size of unified BMT cache used in other related works (256KB) Zubair
and Awad (2019); Ye et al. (2018). ARES was also adapted following the same cache allocation
strategy. In both HERMES and ARES design, all the hardware modules are connected using AXI
interconnect, and limited by the timing requirement of complicated AXI interconnect topology,
the clock frequency in both design is limited up to 100MHz. We believe this clock frequency can
be further improved by replacing AXI interconnect with RTL connection and by a more dedicated
timing constraints.

Resource Utilization
Resource utilization of different components used in HERMES after placement and routing is
shown in Figure 4.8 and Table 4.3. In ARES, the core BMT logic was written using Vivado HLS
66

Table 4.2: Cache configuration and metadata setup used in the experiment.
# BMT Levels
7
Level-0
128KB
Level-4
4KB

Metadata
Counters BMT Nodes
8GB
128MB
18.3MB
BMT Cache Configuration
Level-1
Level-2
Level-3
64KB
32KB
32KB
Level-5
Level-6
Total
512B
128B
260KB
Data Protected

Figure 4.8: Resource layout of HERMES. BMT caches are marked red, pipelined BMT controller
is marked blue, benchmark module is marked yellow, and all other components are marked green.

and the tool consumed a great amount of BRAM resource when synthesizing AXI interfaces even
though there is no memory buffer used in the code. Instead, the core logic of HERMES was
developed in RTL level such that only speculative buffers consumed BRAM resource while AXI
interface did not. Based on reported resource utilization, we calculated total resource utilization

67

Table 4.3: Resource Utilization of Different Hardware Components
Components

Normalized Resource Utilization

HERMES Core
ARES Core
BMT Cache
Benchmark
DDR Controller

Utilization
FF
211410 (8.9%)
105233 (4.5%)
22726 (1.0%)
1342 (0.1%)
20833 (0.9%)

LUT
195991 (16.6%)
72116 (6.1%)
17519 (1.5%)
1602 (0.1%)
18432 (1.6%)
12.6

12.5

12.2

ARES
HERMES
INSECURE

10.0
7.5

7.6

7.1
5.9

5.0

3.9

2.5
0.0

BRAM
30.5 (1.4%)
126 (5.8%)
43.5 (2.0%)
0 (0.0%)
25.5 (1.2%)

1.0

1.0

LUT

FF

1.0

BRAM

Figure 4.9: Normalized total resource utilization of different implementations.

for each of the three implementation, ARES (ARES Core+Cache+DDR), HERMES (HERMES
Core+Cache+DDR), and INSECURE (DDR) as shown in Figure 4.9. HERMES consumes 2x
LUT and 1.7x FF against ARES, and consumes 12.6x LUT, 12.2x FF, and 3.9x BRAM resource
utilization overhead when compared with an insecure system. Against an insecure system, the
resource utilization overhead is still a little bit excessive, and more techniques such as combining
several top-levels due to the high coverage can be explored in the future. Compared with ARES,
HERMES greatly improves throughput with only a 2x resource utilization as a sacrifice, which we
considered reasonable.

68

Write: HERMES vs. ARES (Baseline)

156.3

ARES
HERMES

150

152.6
139.1
125.2

100
70.7

62.6

58.7

56.2

55.4

54.8

54.4

50
14.4

13.3

14.3

14.2

14.1

14.1

125.2

122.1

121.3

BW (MB/s)

BW (MB/s)

Read: HERMES vs. ARES (Baseline)

19.8

14.1

17.5

16.5

15.7

15.4

15.3

15.3

0

Write: HERMES vs. INSECURE

Read: HERMES vs. INSECURE

2000

2280.3
2034.4

2034.4

2034.4

1000

1840.4

827.5

827.5

827.5

500
70.7

26

62.6

29

58.7

212

2378.7

2034.4

1500

0

2465.3

BW (MB/s)

BW (MB/s)

2500

56.2

215

55.4

218

54.8

221

54.4

224

1161.2

156.3

152.6

26

29

139.1

212

125.2

215

125.2

218

1161.2

122.1

221

1161.2

121.3

224

Figure 4.10: Bandwidth comparison of HERMES, ARES (baseline), and insecure system at different access stride.
Counter Access Bandwidth
We compared counter access bandwidth of different implementations as shown in Figure 4.10.
Note that in this work we only focus on counter access peak bandwidth, even though in a real system counter access bandwidth is also dependent on memory access pattern since BMT is behind the
processor last level cache (LLC). For an instance, when running a graph application, since memory accesses are irregular and random, BMT will possibly be triggered at every data access, while
when running an application with a good data locality, most of memory accesses are cached in
LLC such that BMT will be rarely triggered. For each implementation, we tested using the benchmark with different counter access stride S. Different access strides cause different BMT cache
69

hit rate at different levels consequently affecting BMT throughput. For each test, we reported the
highest bandwidth achieved among ten runs. Compared with ARES, HERMES improved counter
write throughput by up to 5.3x and counter read throughput by up to 7.9x. The performance improvement is mainly caused by the efficiently exploited task-level parallelism. Due to the fact that
when reading and verifying a counter, if certain on-path BMT node is cached there is no need
to read and verify higher nodes, counter read achieved 2x bandwidth over counter write. When
increasing counter access stride, a higher counter stride results in low locality at lower BMT levels, and lower BMT levels require fetching from DRAM at every operation, therefore bandwidth
is reduced. Compared with INSECURE, HERMES decreases write bandwidth by 28.7 times and
read bandwidth by 14.6 times, indicating that there is still a room to further improve performance.
By dedicatedly optimizing logic at each BMT level/stage, we believe the performance gap can be
further reduced, which is one of our future research focuses. When increasing S over 218 , INSECURE bandwidth unexpectedly drops by 2.5 times which means the bandwidth of DDR4 device
is dropped. We suspect this is resulted from the internal memory organization and bank/row/col
layout of the device.

Counter Access Latency
Since each BMT level is associated with a separate BMT cache, the caching status at each level
affect the counter access latency. To distinguish the latency of different caching status, we generated a stream of memory requests with a pre-defined memory access pattern by adjusting the initial
address A of the benchmark module to trigger different caching scenarios as shown in Figure 4.11.
In the figure, MTL indicates that required BMT nodes at levels lower than L are all missed in
cache while nodes at levels equal or higher than L are already cached. The worst scenario is all
BMT levels are all missed, which is labeled as “All Miss” in the figure. As shown in the figure,
write latency of both ARES and HERMES are almost consistent across different caching scenar70

Write Latency

700

Read Latency
ARES
HERMES

Latency (Clk Cycles)

600
500
400

478

442
374

526

500
474 474
448435 454456 455
440430 442
414
321

318

362

384

406

432

202
92

100

122

479

319

300
200

458

209

243

271

151

M
iss

A

ll

M
T0

M
T1

M
T2

M
T3

M
T4

M
T5

M
T6

M
iss

A

ll

M
T0

M
T1

M
T2

M
T3

M
T4

M
T5

M
T6

0

Figure 4.11: Latency analysis of HERMES and ARES on different caching status. In the figure
MTL indicates required BMT nodes at/above level L are already cached, while all the BMT nodes
below level L are not cached yet (L is zero-based).

ios, since in the counter write operation, sequential hashing dominates latency and the hashing is
required at all levels independent of caching status. However, due to overhead of message passing
between stages, HERMES has a slightly longer latency than ARES. In general, the write latency
of HERMES is achievable of ARES. Surprisingly, HERMES has a much better read latency than
ARES, with improves counter read latency by up to 3.5 times. Since ARES splits the whole logic
of one BMT verification operation into atomic operations and regroups into multiple barrier synchronized stages, performance is decreased since some logic cannot be immediately activated.
Instead, HERMES split the whole BMT into multiple levels while each level is executed independently and asynchronously, such that synchronization overhead is greatly reduced. Also, ARES
was developed and synthesized with High-Level Synthesis toolset, and we suspect the inefficiency
of the compiler-synthesized hardware was another reason of the baseline’s low performance. The
experiment has proved that the principle behind HERMES, which splits a whole BMT levels into
stages, not only increases performance but also reduces synchronization overhead suffered by the

71

previous work.

Speculative Buffer Performance

Speculative Buffer Hit Rate at Each Level
Stride=26
Stride=29
Stride=212

SB Hit Rate

1.25
1.00

Stride=215
Stride=218

Stride=221
Stride=224

0.75
0.50
0.25
0.00

MT6

MT5

MT4

MT3

MT2

MT1

MT0

Figure 4.12: Speculative buffer hit rate at different access strides.

To evaluate the efficiency of speculative buffers, we tracked the speculative buffer hit rate at each
level at different counter access strides in counter write operation. As shown in Figure 4.12, as the
counter access stride increases, speculative buffer hit rate at each level decreases. In the largest
access stride, only speculative buffer at the first level is hit while all the other speculative buffers
are all missed. However, even in the extreme case, compared to ARES, HERMES still has improved the counter write bandwidth by 4 times over ARES. This comparison has proved that the
speculative buffer improved BMT bandwidth not mainly relying on using SB to equivalently increase each BMT cache size, instead, SB improves BMT throughput by guaranteeing the security
requirement to enable the aggressive and speculative data-level pipelining of BMT and multiple
in-flight counter request processing. In contrast, without speculative execution, ARES has to sequentially process counter requests, such that the performance is extremely low no matter how it
is optimized.
72

BMT Cache Performance
Counter Write

Counter Read

Cache Hit Rate

1.0

1.0

0.8

0.8

HERMES S = 26
ARES S = 26
HERMES S = 29
ARES S = 29
HERMES S = 212
ARES S = 212

0.6
MT6

MT5

MT4

HERMES S = 26
ARES S = 26
HERMES S = 29
ARES S = 29
HERMES S = 212
ARES S = 212

0.6

MT3

MT6

MT5

MT4

MT3

Figure 4.13: Cache hit rate of lower BMT levels at different access strides.

As shown in Figure 4.13, we tracked cache hit rate at lower BMT levels of both ARES and HERMES at different counter access strides. Here cache write miss, cache write hit, and cache read hit
are all counted as cache hit, since in the current implementation cache BMT cache is accessed in
the granularity of cacheline size and a write-miss cacheline will be directly put into cache without
need for reading from memory. Throughout all the counter access strides, HERMES has achieved
almost the same cache hit rate as ARES with a minor gap. The minor gap is mainly caused by the
fact that some BMT accesses are cached by SB, even though the functionality of caching is not the
most important goal of SB. Combined with Figure 4.12, it has already been proved two benefits
of SB, caching and securely enabling speculative execution, while the latter is the main source of
performance improvement over the ARES baseline.

Latency Overhead of BMT on Real Applications
So far we have discussed the peak performance of BMT, however, in most of real applications,
BMT is less frequently activated since the whole BMT subsystem is dependent on last level cache
73

(LLC) and AES counter cache, as indicated in Figure 3.8. When a data block is missed in all caches
and read from memory, the associated AES counter is read from counter cache first, and only when
counter cache is missed, the counter is read from memory and verified by BMT. When a data block
is evicted out of LLC and written back to memory, the associated AES counter is incremented,
consequently BMT is updated. Each counter cacheline contains 64 minor counters, each associated
with a 64B data cacheline, thus each counter is associated with a 4KB page. As such, in most of
use cases with a high data locality, peak bandwidth of BMT is unnecessary to achieve. In this
section, we set up a mathematical model to project performance on real applications. We assume
a direct-mapped last level cache of Md bytes and a direct-mapped counter cache of Mc bytes is
used and already warmed up. More specifically, we assume host sequentially reads/writes 4-byte
data in consecutive clock cycles. When sequentially reading 4B data with N requests, given each
counter cacheline covers 4KB, i.e. 1K data requests, every Tw =

Mc ×1K
64

requests a new counter is

fetched from NVM to fill counter cache. Consequently to fetch N requests, BMT is activated

N
Tw

times, each with counter read latency of Lread . Compared to insecure system, BMT increases total
execution time by
by

Lread ×64
Mc ×1K

N ×Lread
Tw

clock cycles, which the average latency of data accesses is increased

clock cycles. When sequentially writing 4B data with N requests, whenever a data is

evicted out of the last level cache, the evicted block is written to NVM and the counter cache is
consequently updated, no matter counter is cached or not cached. Here we assume counters all
missed for the simplicity of the estimation. As such, every Tr =

Md ×1K
2

requests a data block is

evicted and a counter is updated. Consequently to issue N number of data write requests, BMT is
activated

N
Tr

times, each with counter write latency of Lwrite . Compared to in secure system, BMT

increases total execution time by
operation is increased by

Lwrite ×64
Md ×2

N ×Lwrite
Tr

clock cycles, which the average latency of data write

clock cycles. As shown in Figure 4.11, Lread and Lwrite are both

set as longest latencies, 319 and 500 clock cycles respectively, and we set Md and Mc to 512KB
and 256KB respectively as previous works Zubair and Awad (2019); Ye et al. (2018) configured in
their simulation settings. After calculation, the average latency overhead caused by BMT against
74

both data read and data write operations are close to zero. This projected performance estimation
indicates that even though BMT increases counter access latency, due to the rareness of the BMT
activation, the average data access latency will not be hugely impacted when the program has a
high data locality. For complicated and cache-unfriendly applications such as graph analysis, the
impact of the BMT on data access latency will be more obvious, and we leave the system-level
discussion and optimization to our future work.

Conclusion

In this work, we introduced an efficient hardware architecture for Bonsai Merkle Tree. The previous hardware implementation, proposed in ARES, processes each BMT request sequentially,
such that BMT bandwidth is greatly decreased. Pipelining BMT operations across levels is very
challenging since to strictly follow the security requirement of TCB, each BMT level is potentially dependent on all other levels, consequently forbidding the level-based pipelining. HERMES
innovatively proposed a dataflow architecture and a speculative execution scheme such that multiple outstanding BMT operations can be processed simultaneously without violating security rules.
Compared with ARES, HERMES has improved BMT throughput by up to 7.9 times as well as
reducing latency by up to 3.5 times, while consuming 2 times resource utilization as a sacrifice.

75

APPENDIX : PERFORMANCE MODELING OF ARES

76

In this section, we developed an analytical performance model for various metrics of ARES and
its baseline, such as memory I/O latency, parallel counter recovery time, and parallel Merkle tree
rebuilding speed. This modelling framework not only can be used to accurately predict ARES’
performance on a real NVM device, but also can supply valuable design insights for the future
improvement of ARES systems with the next generation NVM devices.

Memory Latency

Assuming the latency of an hash unit to be Thash and the memory read and write bandwidth to be
BWmemrd and BWmemwr , respectively. If a Merkle tree (MT) block is cached within the Merkle
tree cache, the latency of cache access is Tmt hit . When reading a counter block missed in the
counter cache, a verification composing of three steps are required. Firstly, the Merkle tree node at
each level on the branch is read from the associated Merkle tree cache. Here we assume the lowest
hit level to be L and all the higher-levels are hit in the cache, hence to verify a counter block in
a Merkle tree of height H, H − L Merkle tree blocks are required to be returned directly from
memory, which consumes:
B × (H − L)
, Tmt hit )
BWmemrd
B × (H − L)
≈ Tmemrd +
.
BWmemrd

Tmtrd = Tmemrd + max(

(.1)

After receiving all the upper-level MT nodes, all the MT levels are hashed in pipeline sharing the
same hash unit. The counter and all the MT nodes are required to be hashed such that H + 1
hashing operations are required. The hash unit is fully pipelined such that a new input can be
read
processed every clock cycle, and consequently the time for the parallel verification Tmtverify
=

Thash + H × 1 ≈ Thash . Subsequently after the pipelined verification, all the MT blocks directly
77

returned from memory need to be written back into cache. Without considering evictions, the write
time of this cache can be written as Tmtwr = Tmt hit . If all the above three processes do not overlap
with each other, the total time of reading and verifying a counter value from its cache is:

read
read
.
+ Tmtwr ≈ Tmtrd + Tmtverify
Tverify ctr = Tmtrd + Tmtverify

(.2)

When a counter is updated, all the upper-level Merkle tree nodes are required to be updated. If
the node to update is not cached, then the node needs to be verified first before being updated and
put into cache. If all the upper-level nodes are cached, there is no need to verify nodes such that
their parallel verifications can be completely avoided. Similar to the counter verification, given a
Merkle tree of height H and the lowest hit level is L, the total time for retrieving and verifying all
write
the upper-level nodes Tmtverify
is:

B × (H − L)
, Tmt hit ) + Thash + (H − 1)
BWmemrd
B × (H − L)
≈
+ Thash .
BWmemrd

write
Tmtverify
= max(

(.3)

After all the upper-level nodes are verified, all the nodes are updated sequentially from the bottom
to the top by calculating the hash value of each child block. Each level is subsequently written
to the cache regardless write hit or write miss after being updated. Therefore, the total amount of
time to update a single counter Tupdate ctr is
write
Tupdate ctr = Thash + H × max(Tmt hit , Thash ) + Tmtverify

≈ (H + 1) × Thash +

(.4)

write
Tmtverify

In contrast, since the baseline Merkle tree scheme adopts a recursive and sequential scheme, only

78

when the child level misses in the cache, the parent level block is accessed to verify the child block.
0
= Tmemrd × (H − L + 1).
All the required Merkle Tree blocks are read sequentially, therefore Tmtrd

Only when the upper level is verified and returned, the child level can start hashing the current
0
block thus the total time is linear to the number of uncached levels, i.e., Tverify
= Thash × (H − L).

Similarly to ARES, after all its levels are verified, the baseline system will write all the uncached
0
levels into cache with the total latency Tmtwrite
= Tmt hit × (H − L). Therefore, the total time to

read a counter and verify it with an unoptimized Merkle Tree scheme is
0
0
0
0
0
0
Tverify
ctr = Tmtrd + Tmtverify + Tmtwrite ≈ Tverify + Tmtrd .

(.5)

When reading a ciphertext from the lower-level memory and decrypted, both data and counter are
read from memory in parallel. Since HMAC operates on the ciphertext, the HMAC calculation
and the AES decryption can be executed in parallel thus depending if the counter cache is cached
or not, the total latency is:

Tread =




max(Taes , Thmac ),

if a cache hit,

(.6)



max(Taes , Thmac ) + Tverify ctr , otherwise.
Similarly, the total latency of the baseline implementation to read a counter is:

0
Tread
=




max(Taes , Thmac ),

if a cache hit,

(.7)



0
max(Taes , Thmac ) + Tverify
otherwise.
ctr ,
Putting it all together, the latency decrease of ARES over the non-optimized recursive Merkle Tree
scheme is:
Sread =

0
Tread
Thmac
≈
0
Tread
Thmac + (H − L) × Thash

79

(.8)

Parallel Counter Recovery
Each counter block contains 64 minor counters and a major counter. Currently we assume that
there is no minor counter overflow such that only minor counters are required to recover. Each
7-bit minor counter ctr is associated with a 64-byte data block and a 64-bit ECC. For each pair
of data block and ECC, all the minor counters within the range [ctr, ctr + N ) are used to decrypt
both the ciphertext and the ECC to find the correct minor counter, such that the time for recovering
single minor counter Tsmc = N ∗ (Taes + Tmemrd ). For each counter block, the recovery of 64 minor
counters are executed sequentially such that the total time to read a single counter block, recover,
and write the counter black to memory Tsc is:

Tsc = Tmemrd + Tsmc ∗ 64 + Tmemwr .

(.9)

ARES adopts multiple counter recovery modules to simultaneously recover Pctr counters, such
that the total latency to recover all the counter blocks Tctr is:

Tctr =

M
Tsc
Pctr

(.10)

Here M denotes the total number of counter blocks to recover.

Parallel Merkle Tree Rebuilding
After all counters are recovered, a Merkle tree is rebuilt from the bottom level to the top. For each
64-byte Merkle tree node, 8 child nodes are read from memory first, and for each child node the
64-bit hash value is calculated and filled the parent node. After traversing all the 8 child nodes, a
parent node is rebuilt and written to the memory. Therefore, to recover a single Merkle Tree node,

80

the latency Tsmt involves reading 8 child nodes, hashing child nodes in pipeline with single hash
unit, and writing the parent node back to the memory:

Tsmt = Tmemrd + (Thash + (8 − 1) × 1) + Tmemwr .

(.11)

Similar to the parallel counter recovery, ARES instantiates Pmt instances to recover sub-trees in
parallel. Thus given a Merkle tree of height H, the total amount of time for recovering all the
Merkle tree nodes Tmt is:
Tmt =

H−1
1 X l
8 Tsmt
Pmt l=0

(.12)

When recovering a single Merkle tree node, a new 64-byte child node is required on every clock
cycle thus when parallelized by a factor of Pmt , Pmt 64-byte blocks are required per clock cycle.
Bounded by the memory bandwidth, the parallelism Pmt needs to fulfill:

Pmt × fclk × 64B ≤ BWmemrd .

81

(.13)

LIST OF REFERENCES
Mazen Alwadi, Aziz Mohaisen, and Amro Awad. 2019. Phoenix: Towards persistently secure,
recoverable, and nvm friendly tree of counters. arXiv preprint arXiv:1911.01922 (2019).
Amro Awad, Pratyusa Manadhata, Stuart Haber, Yan Solihin, and William Horne. 2016. Silent
shredder: Zero-cost shredding for secure non-volatile main memory controllers. ACM SIGPLAN
Notices 51, 4 (2016), 263–276.
Paweł Chodowiec and Kris Gaj. 2003. Very compact FPGA implementation of the AES algorithm.
In International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 319–
333.
Victor Costan and Srinivas Devadas. 2016. Intel SGX Explained. IACR Cryptol. ePrint Arch.
2016, 86 (2016), 1–118.
Donald Eastlake and Paul Jones. 2001. US secure hash algorithm 1 (SHA1). (2001).
Samantha J Edirisooriya, Shanker R Nagesh, Blaine R Monson, and Pankaj Kumar. 2018. Method
and apparatus for completing pending write requests to volatile memory prior to transitioning to
self-refresh mode. (Nov. 13 2018). US Patent 10,127,968.
Reouven Elbaz, David Champagne, Ruby B Lee, Lionel Torres, Gilles Sassatelli, and Pierre
Guillemin. 2007. Tec-tree: A low-cost, parallelizable tree for efficient defense against memory
replay attacks. In International Workshop on Cryptographic Hardware and Embedded Systems.
Springer, 289–302.
Forbes. 2013.

Amazon.com Goes Down, Loses $66,240 Per Minute.

https:

//www.forbes.com/sites/kellyclay/2013/08/19/amazon-com-goesdown-loses-66240-per-minute/?sh=3c1ad34f495c. (2013).
82

Alexander Freij, Shougang Yuan, Huiyang Zhou, and Yan Solihin. 2020. Persist Level Parallelism: Streamlining Integrity Tree Updates for Secure Persistent Memory. In 2020 53rd Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 14–27.
Blaise Gassend, E Suh, Dwaine Clarke, Marten Van Dijk, and Srinivas Devadas. 2003. Caches
and merkle trees for efficient memory authentication. In Proceedings of Ninth International
Symposium on High Performance Computer Architecture.
Tim Good and Mohammed Benaissa. 2005. AES on FPGA from the fastest to the smallest. In
International workshop on cryptographic hardware and embedded systems. Springer, 427–440.
W Eric Hall and Charanjit S Jutla. 2005. Parallelizable authentication trees. In International Workshop on Selected Areas in Cryptography. Springer, 95–109.
Intel. 2020a.

Intel Architecture Memory Encryption Technologies Specification.

https:

//software.intel.com/sites/default/files/managed/a5/16/MultiKey-Total-Memory-Encryption-Spec.pdf?source=techstories.org.
(2020).
Intel. 2020b.

Intel Optane Persistent Memory.

https://www.intel.com/content/

www/us/en/architecture-and-technology/optane-dc-persistentmemory.html. (2020).
Intel. 2020c.

Intel Optane Persistent Memory Write Endurance.

https://

www.intel.com/content/www/us/en/architecture-and-technology/
optane-technology/delivering-new-levels-of-endurance-articlebrief.html. (2020).
Intel. 2020d.

Intel SGX.

https://software.intel.com/content/www/us/en/

develop/topics/software-guard-extensions/details.html. (2020).
83

Changmin Lee, Wonjae Shin, Dae Jeong Kim, Yongjun Yu, Sung-Joon Kim, Taekyeong Ko, Deokho Seo, Jongmin Park, Kwanghee Lee, Seongho Choi, et al. 2020. NVDIMM-C: A ByteAddressable Non-Volatile Memory Module for Compatibility with Standard DDR Memory Interfaces. In 2020 IEEE International Symposium on High Performance Computer Architecture
(HPCA). IEEE, 502–514.
Sihang Liu, Korakit Seemakhupt, Gennady Pekhimenko, Aasheesh Kolli, and Samira Khan. 2019.
Janus: Optimizing memory and storage support for non-volatile memory systems. In 2019
ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). IEEE,
143–156.
John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance
Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA)
Newsletter (Dec. 1995), 19–25.
James Pallister, Simon Hollis, and Jeremy Bennett. 2013. BEEBS: Open benchmarks for energy
measurements on embedded platforms. arXiv preprint arXiv:1308.5174 (2013).
Brian Rogers, Siddhartha Chhabra, Milos Prvulovic, and Yan Solihin. 2007. Using address independent seed encryption and bonsai merkle trees to make secure processors os-and performancefriendly. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO
2007). IEEE, 183–196.
Andy Rudoff. 2016.

Deprecating the PCOMMIT instruction.

software. intel. com/en-

us/blogs/2016/09/12/deprecate-pcommit-instruction, Intel Corp (2016).
G Edward Suh, Dwaine Clarke, Blaise Gassend, Marten Van Dijk, and Srinivas Devadas. 2003.
AEGIS: architecture for tamper-evident and tamper-resistant processing. In ACM International
Conference on Supercomputing 25th Anniversary Volume. 357–368.
84

Jakub Szefer and Sebastian Biedermann. 2014. Towards fast hardware memory integrity checking
with skewed Merkle trees. In Proceedings of the Third Workshop on Hardware and Architectural
Support for Security and Privacy. 1–8.
Saru Vig, Guiyuan Jiang, and Siew-Kei Lam. 2018. Dynamic skewed tree for fast memory integrity
verification. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).
IEEE, 642–647.
Saru Vig, Rohan Juneja, Guiyuan Jiang, Siew-Kei Lam, and Changhai Ou. 2019. Framework for
fast memory authentication using dynamically skewed integrity tree. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems 27, 10 (2019), 2331–2343.
Saru Vig, Siew-Kei Lam, and Rohan Juneja. 2021. Cache-Aware Dynamic Skewed Tree for Fast
Memory Authentication. In Proceedings of the 26th Asia and South Pacific Design Automation
Conference. 402–407.
Saru Vig, Tan Yng Tzer, Guiyuan Jiang, and Siew-Kei Lam. 2017. Customizing skewed trees
for fast memory integrity verification in embedded systems. In 2017 IEEE Computer Society
Annual Symposium on VLSI (ISVLSI). IEEE, 213–218.
Zeke Wang, Hongjing Huang, Jie Zhang, and Gustavo Alonso. 2020. Benchmarking High Bandwidth Memory on FPGAs. arXiv preprint arXiv:2005.04324 (2020).
Mario Werner, Thomas Unterluggauer, Robert Schilling, David Schaffenrath, and Stefan Mangard.
2017. Transparent memory encryption and authentication. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1–6.
Fan Yang, Youmin Chen, Haiyu Mao, Youyou Lu, and Jiwu Shu. 2020. ShieldNVM: An Efficient
and Fast Recoverable System for Secure Non-Volatile Memory. ACM Transactions on Storage
(TOS) 16, 2 (2020), 1–31.
85

Fan Yang, Youyou Lu, Youmin Chen, Haiyu Mao, and Jiwu Shu. 2019. No compromises: Secure
NVM with crash consistency, write-efficiency and high-performance. In 2019 56th ACM/IEEE
Design Automation Conference (DAC). IEEE, 1–6.
Mao Ye, Clayton Hughes, and Amro Awad. 2018. Osiris: A Low-Cost Mechanism to Enable
Restoration of Secure Non-Volatile Memories. Technical Report. Sandia National Lab.(SNLNM), Albuquerque, NM (United States).
Vojin Zivojnovic. 1994. DSPstone: A DSP-oriented benchmarking methodology. Proc. Signal
Processing Applications & Technology, Dallas, TX, 1994 (1994), 715–720.
Yu Zou and Mingjie Lin. 2019. FAST: A Frequency-Aware Skewed Merkle Tree for FPGASecured Embedded Systems. In 2019 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI). IEEE, 326–331.
Kazi Abu Zubair and Amro Awad. 2019. Anubis: ultra-low overhead and recovery time for secure non-volatile memories. In Proceedings of the 46th International Symposium on Computer
Architecture. 157–168.
Pengfei Zuo, Yu Hua, and Yuan Xie. 2019. SuperMem: Enabling application-transparent secure
persistent memory with low overheads. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 479–492.

86

