Emerging non-volatile main memory (NVMM) is rapidly being integrated into computer systems. However, NVMM is vulnerable to potential data remanence and replay attacks.
I. INTRODUCTION
Non-volatile main memory (NVMM) is coming online, offering non-volatility, good scaling potential, high density, low idle power, and byte addressability. A recent NVMM example is Intel Optane DC Persistent Memory, providing a capacity of 3TB per socket [22] . Due to non-volatility, data may remain in main memory for a very long time even without power, exposing data to potential attackers [8] . Consequently, NVMM requires memory encryption and integrity protection to match the security of DRAM (which we refer to as secure NVMM), or to provide secure enclave environment. Furthermore, it is expected that NVMM may store persistent data that must provide crash recoverability, a property where a system can always recover to a consistent memory state after a crash. Crash recoverability property offers multiple benefits, such as allowing persistent data to be kept in memory data structures instead of in files, and as a fault tolerance technique to reduce checkpointing frequency [1] , [14] , [23] , [24] , [46] . Finally, some applications have emerged that need to run on secure enclave and yet require persistency and crash recovery, such as a shadow file system [19] .
Crash recovery of data with NVMM is achieved through defining and using memory persistency models. However, traditional memory persistency models do not automatically extend to secure NVMM, which incur two new requirements: 1 the correct plaintext value of data must be recovered, and 2 data recovery must not trigger integrity verification failure. To meet these requirements, the central question is what items must persist together, and what persist ordering constraints are there to guarantee the above crash recovery requirements? Prior studies have not answered this question fully. Liu et al. pointed out that counters, data, and message authentication codes (MACs) must persist atomically [33] , but ignored the Merkle Tree that provides integrity protection required to avoid effortless cryptanalysis. Awad et al. pointed out that Merkle Tree must also be persisted leaf-to-root [4] , but did not answer what ordering requirements are needed for correct crash recovery, and how they are related to persistency models.
The focus of this work is to comprehensively analyze the persist requirement and persist ordering requirements required for correct crash recovery of secure NVMM. Getting this analysis right is important. Not only it affects correctness (i.e., whether the above crash recovery requirements are met), but it also affects performance overheads (i.e., accurate quantification of the actual performance overheads) and the reasoning of what performance optimizations are possible. For example, one property missed by prior work is that leafto-root updates of Bonsai Merkle trees (BMT) must follow persist order, otherwise crash recovery may trigger integrity verification failure at system recovery. Obeying this ordering constraint, we found that the overheads of crash recoverable strict persistency (SP) is about 20× slowdown, which is more than one order of magnitude higher than previously reported slowdown.
In this paper, we analyze and derive invariants that are needed to ensure correct crash recovery (i.e., correct plaintext value is recovered and no integrity verification failure is triggerred). Then, to reduce the performance overheads, we propose performance optimizations, which we refer to as persist-level parallelism, or PLP, that comply with the invariants for strict and epoch persistency (EP) models. For SP, we found that pipelining BMT updates is an effective PLP optimization, which brings down the performance overheads from 7.2× to 2.1×, compared to a secure processor model with write back caches but not supporting any persistency model. We then analyze EP where persist ordering within an epoch is relaxed, but enforced across epochs. Under EP, two more PLP optimizations were enabled besides pipelining: outof-order BMT update and BMT update coalescing. These two optimizations reduce overheads to 20.2%.
To summarize, the contributions of this paper are:
• To our knowledge, this is the first work that fully analyzes crash recovery correctness for secure NVMM, and formulates crash recovery invariants required under different persistency models. • For strict persistency, we propose a new optimization for pipelining BMT updates. • For epoch persistency, we propose two new optimizations: out-of-order BMT updates and BMT update coalescing. • We point out that, due to incomplete reasoning of crash recovery, prior studies did not provide correct crash recovery and substantially underestimated its performance overheads. • An evaluation showing that our proposed PLP optimizations above significantly reduce the performance overhead of secure NVMM. The remainder of the paper is organized as follows. Section II presents the background and related work. Section III dives into the motivation for our work, Section IV details four BMT update systems, including the baseline used for evaluation and the three models proposed. Section V discussed our hardware architecture, Section VI evaluates our proposed update mechanisms, and Section VII concludes this work.
II. BACKGROUND AND RELATED WORK A. Threat Model
We assume an adversary who has physical access to the memory system (NVMM and system bus), e.g. through ownership, theft, acquisition after system disposal, etc. Similar to the incidence of recovering sensitive data from improperly disposed used hard drives [42] , [54] , data remanence in NVMM extends such vulnerabilities to data in memory [8] . In addition, NVMMs are potentially vulnerable to replay attacks [2] and cold boot attacks [20] , [38] , which allow malicious entities access to the systems. Similar to prior work [3] , [4] , [30] , [31] , [45] , we assume that the adversary cannot read the content of on-chip resources such as registers and caches, hence the processor chip forms the trust boundary where trusted computing base (TCB) may be located. All off-chip devices, including main memory and memory bus, are considered vulnerable to both passive (snooping) and active (tampering) attacks. These assumptions are essential to secure processor architecture [9] , [15] , [48] , [50] , [53] , [56] , [57] .
B. Memory Encryption
The goal of memory encryption is to conceal the plaintext of data written to the off-chip main memory [29] , [32] , [49] , [63] .
Counter mode encryption [3] , [37] , [43] , [56] is commonly used for this purpose. It works by encrypting a counter to generate a pseudo one time pad (OTP) which is XORed with the plaintext (or ciphertext) to get ciphertext (or plaintext). To be secure, pads cannot be reused, and hence the counter must be incremented after each write back (for temporal uniqueness) and concatenated with address to form a seed (for spatial uniqueness). Counters may be monolithic (as in Intel SGX [12] , [18] ) or split (as in Yan et al. [43] , [56] ). Split counter co-locates a per-page major counter and many per-block minor counters on a single cache block, and each cache block is represented by the concatenation of a major and a minor counter. Due to its much lower memory overhead (1.56% vs. 12.5% with monolithic counter [56] ), counter cache performance increases and the overall decryption overhead decreases. Hence, we assume the use of a split counter organization for the rest of the paper.
C. Memory Integrity Verification
Memory encrypted using counter mode encryption is vulnerable to a counter replay attack which allows the attacker to break the encryption [56] , hence memory integrity verification is needed not only to protect data integrity, but also to protect encryption from trivial cryptanalysis [40] , [61] . Data fetched from off-chip memory must be decrypted and its integrity verified when it arrives on chip. Early memory integrity protection relied on Merkle Tree covering the entire memory [16] , with the root of the tree always kept securely on chip. When using counter mode encryption, Rogers et al. proposed Bonsai Merkle Tree (BMT) [43] that employs stateful MACs to protect data, leaving a much smaller and shallower tree covering only counters. A stateful MAC uses data, address, and counter as input to the MAC calculation; any modification to any MAC input or the MAC itself becomes detectable. Since it is sufficient to have one input component with freshness protection, BMT only needs to cover counters. Intel SGX adopted this observation to design a similar stateful MAC approach to construct a counter tree that combines counters and MACs [18] . Saileshwar et. al [45] and Taassori et. al [52] discussed optimizations that further reduce the BMT size.
D. Intel SGX
Secure enclaves, e.g. Intel SGX, were designed to provide secure execution environments for application software. By combining memory encryption and integrity protection with attestation and key sealing/management, it provides an application protection against compromised system software (OS and hypervisor). SGX Memory Encryption Engine (MEE) uses monolithic 56-bit counters instead of the more space efficient split counter. An integrity "counter" tree is used for integrity verification. A counter tree node co-locates counters with a MAC, the MAC accepts as input the counters it is co-located, and the parent counter for the block. This results in a highly interleaved but parallelizable integrity tree that covers the entire enclave memory [12] , [18] .
E. Memory Persistency
Memory persistency is defined to allow the reasoning of crash recovery for persistent data, an important benefit offered by non-volatile main memory (NVMM) [6] , [7] , [11] , [13] , [25] , [27] , [34] , [39] , [41] , [55] . Specifically, it defines the visible ordering of loads and stores seen by a crash recovery observer [35] , [39] . A persistency model defines when a store persists (i.e. becomes durable) with respect to other stores of the same thread, and oftentimes coupled with memory consistency models to ensure visibility to other threads. The most conservative model, strict persistency (SP) requires that persists follow the sequential program order of stores [39] . While providing simple reasoning, SP does not allow any overlapping or reordering of persists, limiting optimization opportunities in the system and incurring high performance overheads. More relaxed persistency models include epoch persistency (EP) and buffered epoch persistency (BEP) [39] , as well as lazy persistency [1] . With EP (or BEP), programmers define regions of code that form epochs [17] , [26] . Persists within an epoch can be reordered and overlapped, but persists across epochs are strictly ordered using persist barriers, which enforce that persists in an older epoch must complete prior to the execution (or completion) of any persist from a younger epoch. On top of a persistency model, crash recovery often requires the programmer to define atomic durable code regions [10] , [13] , [36] , [44] , [47] , [60] .
F. Secure NVMM for Crash Recovery
Data remanence vulnerability exists with DRAM as data may persist for weeks under very low temperature [20] , [38] . The vulnerability is much worse with NVM since data is retained for years, hence self-encrypting memory has been proposed [8] , [59] , [63] . However, NVM will likely host persistent data that requires supporting crash recovery, which then requires integrating memory encryption and integrity verification with memory persistency. This has been explored only recently. Swami et. al [51] proposed co-locating data, counters, and MAC, to make it easier to atomically persist them together. Liu et al. [33] proposed a similar approach, plus an alternative approach of using the memory controller (MC) as a gathering point for atomic persistence. While necessary, these studies ignored Merkle Tree integrity verification required for correct crash recovery. Awad et al. [4] looked at persisting data, counters, and BMT, but ignored ordering requirements needed for correct crash recovery, and persistency models that are relevant for the ordering. Zuo et. al [64] proposed coalescing counter for persisting counter cache data, but did not acknowledge additional requirements for counter integrity verification.
III. CORRECTNESS OF CRASH RECOVERY
In this section, we formulate the invariants that need to be ensured in order to support crash recovery for secure NVMM. The system we assume is one with volatile on-chip caches, with the persistent domain including the NVMM and the write pending queue (WPQ) inside the MC. Our analysis focuses on a system with counter-mode memory encryption along with MAC and BMT memory integrity verification. Counters, MACs, and BMT nodes are cacheable and can be lost with the loss of power, except the BMT root which is always stored persistently on chip. We will discuss Intel SGX MEE later in the paper.
Suppose that plaintext P at address A is encrypted using counter γ and private key K to yield ciphertext C, i.e., C = E K (P, A, γ) and necessarily the decryption follows P = D K (C, A, γ). Suppose also that M represents a message authentication code for C, i.e., M = M AC K (C, A, γ). Finally suppose that BMT covers all counters and has a root R. We define BMT update path as follows:
Definition 1: BMT update path is the path of nodes from a leaf node (i.e., one encryption page) to the root of BMT. Figure 1 shows an example with two persists that generate updates to the BMT. Update δ 1 is represented by nodes shown in grey while update δ 2 is shown with stripes. Note that all update paths necessarily intersect at the root but the intersection can also happen earlier.
Definition 2: Common Ancestors of two persists are nodes in the BMT tree that appear in the BMT update paths of both persists. The Least Common Ancestor (LCA) is a common ancestor that is at the lowest-to-leaf level compared to all other common ancestor nodes.
In the example in Figure 1 , the common ancestor consists of only the BMT root, hence the BMT root is also the LCA. However, if another persist causes an update at node X46, then this update and δ 2 share X22 and X1 as common ancestors, with X22 being the LCA.
We also define a memory tuple as a collection of items that are needed to crash recover a datum:
Definition 3: Secure memory transforms an on-chip plaintext data P at block address A to a memory tuple of (C, γ, M, R) when data is persisted to main memory, and vice versa when persisted data is read from main memory.
The memory tuple represents the totality of transformation of a block when it is written back (out of the last level cache or LLC) to off-chip memory, and we claim that each tuple item must be available in order to recover data correctly, and failure to persist any item(s) in the tuple results in a crash recovery problem:
Invariant 1: Crash Recovery Tuple Invariant. In a secure memory with counter-mode encryption and MAC/BMT verification, in order to recover a datum P that was persisted in memory, its entire memory tuple (C, γ, M, R) must have been persisted as well.
To illustrate this, suppose that a plaintext value P o is changed to a new value P n . The memory tuple for the block then must change from
If some tuple item was not persisted, for example M n , postcrash, (C n , γ n , M o , R n ) is recovered. In this case, the correct plaintext is recovered but MAC verification fails because the old MAC (M o ) fetched from memory mismatches with M AC K (C n , A, γ n ). If instead γ n was not persisted, since P n = D K (C n , A, γ o ), the correct plaintext is not recovered. Not only that, since γ o is input to MAC and BMT verification, both verification mechanisms fail as well. Table I lists the outcomes of not persisting one or more of the memory tuple. 
Note that the crash tuple invariant (Invariant 1) specifies the necessary and sufficient condition for recovering data post crash. It does not specify exactly "when" tuple items must be persisted with respect to the data persist; this depends on the crash recovery expectation of the program and the persistency model being assumed.
So far we have discussed the crash recovery correctness for a single data persist. To support crash recovery, programmers must reason about not just a single persist, but multiple persists and the relative ordering between them. In this case, we assume that if there is possibility that the crash recovery observer reads the persistent memory state between two persists, then the two persists must be ordered. Now suppose that there are two ordered persistent stores (persists) α 1 and α 2 to the different blocks. For the memory tuples of these different blocks, it is possible that these blocks may modify the same counter block, the same MAC block, and definitely the same BMT root. If the persist order of memory tuples is not followed, recoverability is problematic. For example, suppose that α 1 → α 2 but R 2 → R 1 , which means that the BMT root is updated by the second persist before by the first persist. If a crash occurs prior to either of them or after both of them, recoverability is not jeopardized. But at other points, recovery can fail. For example, suppose that a crash occurs after α 1 and R 2 persist but before α 2 and R 1 persist. Post crash, BMT verification failure occurs due to the root not reflecting the persist of α 1 . In other words:
Invariant 2: Persist Order Invariant. Suppose that α 1 happens before α 2 in program order. If the crash recovery observer may read out the persistent state between α 1 and α 2 , then α 2 must follow α 1 in persist order, i.e. α 1 → α 2 . If α 1 → α 2 in persist order, then for correct crash recovery, the following must hold:
in persist order, i.e. the persist order of each respective memory tuple items must follow the order of data persists.
Note that the persist order depends on the persistency models that are assumed. For SP, every persist is ordered with respect to others. Hence, Invariant 2 applies to each pair of persists. However, for an EP model, persists are ordered or Invariant 2 applies only if they are from different epochs. Persists from the same epoch are unordered, which gives a rise to optimization opportunities that we will discuss in Section IV.
The key consequence of Invariant 2 is that persist ordering imposes a very high cost that scales with the size of BMT. Upon eviction of a block from LLC, the data, its counter, and MAC are sent to the MC, but there they must wait until BMT update from leaf reaches the root, before the persist can be considered successful. For example, for a full 8-ary BMT constructed for an 8TB NVMM system would have a tree height of 12, meaning that for an atomic writeback of security metadata, the change to leaf nodes must traverse the 12 levels of the BMT to persist the BMT root, prior to the next persist. Assuming a hash latency of 80 processor cycles [30] , this adds up to 960 processor cycles for one memory update!
IV. STREAMLINING BMT UPDATES
In this section, we explore how BMT update performance due to persists can be improved. Performance optimization techniques that are possible depend on 1 no violation against invariants discussed in the previous section, and 2 the persistency model that is assumed. We collectively refer to the key methods as persist-level parallelism (PLP): pipelining, out-oforder updates, and coalescing.
A. Strict Persistency 1) Baseline Atomic Persist Mechanism: Following Invariant 1, for each memory update, we need to ensure that all memory tuple components also persist. Due to the write-back cache, the eviction order of dirty blocks may be different from the program order. Therefore, with SP, one way to satisfy the invariant is to atomically persist the tuple generated by each store, which results in write-through cache behavior. To achieve this, we devise a 2-step persist (2SP) mechanism. Similar to [33] , 2SP relies on the WPQ of the MC as persist gathering point. 2SP consists of two steps: the first step involves holding and locking persist memory tuple components in the WPQ (while flagged as incomplete), while the second step flags the completion of the persist and releases tuple components to memory. A persist is marked completed when the WPQ receives its updated counter, MAC, and acknowledgement that the BMT root has been updated. Once completed, the blocks are allowed to drain from the WPQ to the NVMM. On power failure, any incomplete flagged blocks are considered not persisted and invalidated. Since the persistence of the counter and MAC is straightforward and not expensive, we will focus the rest of the discussion on the expensive BMT update.
To illustrate the mechanism, suppose that two persists are initiated, as shown in Figure 1 . Figure 2 shows the sequence of persists of memory tuples due to the two persists, in the baseline persist mechanism. For persist δ 1 , ciphertext C 1 , counter γ 1 , MAC M 1 are persisted. A new value of counter γ 1 is needed for the BMT update path starting from leaf of BMT X41, which in turn is needed to update BMT node X31, and so on, until BMT root X1 is updated. When ciphertext C 1 , counter γ 1 , and MAC M 1 are completed and BMT root is updated, δ 1 is considered completed, after which persist δ 2 can commence. It is clear that even though intermediate nodes in the BMT update path do not need to persist (only the leaves and root need to persist), the critical path is due to their sequential updates. 2) PLP Mechanism 1: Pipelining BMT Updates: While the baseline persist mechanism described in Section IV-A1 is correct, it suffers from high overheads. Each node in the BMT update path must wait until the previous node has been calculated. In order to improve this situation, recall that the Persist Order Invariant (Invariant 2) only requires that the BMT root update follows the persist order. This means that it is possible to update BMT nodes out of order, as long as the root is still updated in persist order. This is illustrated in Figure 3 (a), where update paths of persist δ 1 and persist δ 2 are updated out of order but updates to BMT root are kept in persist order. While out of order non-root updates are best for performance, it is difficult to avoid write-after-write (WAW) hazards if two persists' BMT update paths intersect at more than just the BMT root. To avoid WAW without much complexity, we design a more restrictive version of the optimization, namely pipelined BMT update. With a pipelined update, a younger persist is allowed to update a certain level of BMT only when an older persist has completed its update of the same level BMT node. This is illustrated in Figure 3 (b). The pipelined update optimization ensures that if two persists have common ancestor nodes, they will still be updated in persist order.
Note that as the memory grows bigger, the BMT will have more levels and hence more pipeline stages. Thus, one attractive feature of pipelined BMT updates is that with larger memories, the degree of PLP increases and pipelined BMT updates becomes even more effective versus non-pipelined updates.
B. Epoch Persistency
With EP, two persists in the same epochs do not have persist ordering constraints; persists only need to be ordered across separate epochs. This fact allows the write-back cache to reduce the write traffic and also gives us opportunities to optimize BMT updates. We make a stronger assumption on EP compared to that in literature: Nalli et. al [36] assert that 75% of epochs update one 64B cache line, where we assume a minimum of one store per epoch. Specifically, we assume that crash recovery does not depend on the transient persistent state within an epoch while an epoch is executing. Instead, crash recovery depends only on the persistent state at an epoch boundary. This assumption requires that any actions performed by an epoch that were not completely persisted prior to crash must be re-executable. This assumption is reasonable, because epochs are usually components of a durable transaction, and durable transactions can be re-executed if they fail.
1) PLP Mechanism 2: Out-of-Order BMT Updates: Invariant 2 applies to two persists that are ordered, i.e. in EP, they belong to two different epochs. It does not specify how to treat two persists that are not ordered, such as those belonging to the same epoch. The question then arises whether two unordered persists can be performed out of order (OOO), and if so, to what extent and whether there are any constraints that need to be observed.
Before discussing them further, let us first discuss the potential benefit of OOO. OOO BMT updates have a much better performance potential than (in-order) pipelining for two reasons. First, it can hide the BMT cache miss latency as illustrated in Figure 4 . Figure 4(a) shows a case where persist δ 1 is attempting to update the BMT, but suffers a cache miss on BMT node X41. This introduces bubbles in the inorder BMT update pipeline, and persist δ 2 is consequently delayed, therefore it cannot update X48 until X41 is updated. Figure 4 (b) illustrates that with OOO, both updates can occur in parallel, with δ 2 not being delayed by the cache miss that δ 1 must wait for. Therefore, OOO can achieve a higher degree of PLP compared to in-order pipelining. Second, OOO BMT updates enable us to use pipelined MAC units to improve the throughput. The in-order BMT update pipeline has the same number of stages as the levels in the BMT and there is at most one update at each level. Therefore, the throughput of pipelined BMT is limited to one BMT update per X cycles, where X is the MAC latency. In contrast, with OOO, a BMT update can start at every cycle, thereby increasing the throughput to one BMT update per cycle. Regarding correctness of OOO execution of persists from the same epoch, a concern arises that there may be a write after write (WAW) hazard in the case where two persists have their BMT update paths intersecting at not just the BMT root. The hierarchical nature of BMT dictates that if two BMT update paths intersect, the intersection representing common ancestors manifests as common suffix in the paths, starting from the lowest common ancestor (LCA) node, and then continuing to the LCA's parent, grandparent, etc. until the BMT root. Does updating common ancestor nodes out of order trigger a WAW hazard? We assert that they do not.
In order to prove it, we note that different blocks will cause different counters to be updated. Let us denote the old counter values as γ 1o and γ 2o and the new values as γ 1n and γ 2n . The counters correspond to either one BMT leaf node (if the counters are co-located in a block) or two BMT leaf nodes (if the counters are not co-located in a block). In the former, the leaf node is the LCA, while in the latter the LCA is further up the tree. Suppose that persist δ 1 updates the LCA before δ 2 . Then, at the end of the LCA update for both persists, the LCA value is M AC K (γ 1n , γ 2n , . . .). If instead δ 2 updates the LCA before δ 1 , the LCA value is also M AC K (γ 1n , γ 2n , . . .), which is unchanged. Therefore, the final LCA value is the same, and hence the BMT root is also the same. The intermediate LCA value is different when δ 1 or δ 2 update the LCA first. However, in EP, the crash recovery observer does not expect a particular persist order for two persists in the same epoch. Furthermore, Invariant 2 assumes that the crash recovery observer will not read the transient persistent state between the two persists. For the latter case, δ 1 and δ 2 will update different parts of the LCA, hence the same proof holds.
The epoch boundary, however, places constraints on the degree of PLP, as it acts as point of ordering; all persists in the previous epoch must complete prior to any persist in a new epoch can complete. Thus, the higher the number of persists in an epoch, the higher is its potential PLP.
To handle OOO, the 2SP only needs minor modifications. When blocks belonging to persists from the same epoch are written back from the LLC, they are no longer locked in the WPQ. They are allowed to drain to persistent memory as they come. However, the WPQ retains enough state to monitor if the memory tuples of persists of the same epoch have all arrived at the WPQ or not. When they have all arrived, they are marked completed and the epoch is considered complete. On the other hand, blocks from the next future epoch are locked in the WPQ and marked incomplete, until the previous epoch has completed.
2) PLP Mechanism 3: BMT Update Coalescing: Further analysis of BMT updates within an EP model exposes a notable scenario that enables our final optimization. BMT updates within an epoch are likely to involve substantial number of common ancestor nodes, due to spatial locality. While OOO allows updates to BMT to be overlapped and performed out of order, there are still many updates to BMT nodes that occur. These updates can be considered superfluous considering that the same node may be updated multiple times by persists from the same epoch. In our final optimization, we seek to remove superfluous BMT updates by coalescing them. Figure 5 illustrates the update order of OOO persists with coalescing. Without coalescing, each persist incurs updating of four BMT nodes, causing a total of 12 updates. With coalescing, persists δ 1 and δ 2 updates are coalesced at their LCA (node X31), while δ 3 is coalesced at the LCA at node X21. As a result, there are only seven updates to the BMT, which in this example corresponds to 42% reduction in BMT updates. Fewer updates to the BMT reduce the occupancy of the memory integrity verification engine, and hence reduces the latency and improves the throughput of the engine. Furthermore, an equally important benefit to coalescing is the number of writes. Without coalescing, the BMT root is updated three times: with coalescing, it is updated only once.
Coalescing's effectiveness increases with spatial locality. Spatial locality results in nearby blocks being updated. In the best (and also frequent) case, blocks belonging to the same encryption page (a 4KB region) are updated within the epoch. They result in a single counter block being updated multiple times. Without coalescing, each such update generates BMT updates from leaf to root, while with coalescing, there is only one leaf-to-root update, thereby resulting in a substantial saving. 
V. ARCHITECTURE DESIGN
In this section, we propose architecture design to enable the PLP optimizations. As a baseline architecture, we assume a discrete counter cache [56] , BMT cache (mtcache) [4] , [58] , MAC cache [62] , and persist-gathering WPQ [33] . These structures suffice if an unoptimized SP model is adhered to. To support our optimizations, additional structures are introduced, specifically schedulers, to retain the persist ordering. These schedulers will contain information that enforces BMT update order by allowing or preventing writes to occur. Each optimization has its own set of conditions for allowing or preventing writes, and will be analyzed next.
A. Strict Persistency Model: In-order Pipelined BMT Updates
To support our first PLP technique, in-order pipelined BMT updates for SP, we introduce a new structure called persist tracking table (PTT) that enforces persist ordering in a SP model.
The PTT interacts with a scheduler that also interacts with the BMT cache and the MC / WPQ. Each entry in the PTT has multiple fields ( Figure 6 ). The field Lvl indicates the level of the BMT that the persist is currently updating, and is used to enforce in-order pipelining by staggering persists on different BMT levels. Figure 6 shows an example of the PTT with four persist entries. δ 1 is updating level 1 (node X1), while δ 2 is updating level 2 (node X21), etc. The valid bit V is set when the entry is created and cleared when the persist has updated the BMT root. The ready bit R is set when updating the current BMT node has been completed, and cleared when the update moves on to the next node in the BMT update path. The PTT is managed as a circular buffer using a head and a tail pointer. The persist flag P is set when the BMT root has been updated and the entry can be removed: if the head pointer points to this entry (indicating this entry being the oldest) and the P bit is set, then BMT update is considered completed, and both the PTT entry and WPQ entry can be deallocated. The WPQptr field points to the corresponding persist entry in the WPQ. The PendingNode field indicates the ID/label of the node currently being updated.
In the figure, δ 1 has finished updating the BMT root hence V = 0 and P = 1. δ 2 and δ4 have updated their current nodes shown in the PendingNode fields, i.e., X21 for δ 2 and X47 for δ4, hence R = 1. δ 3 's R bit is not set yet, either because the BMT node is not yet available for update (e.g. not found in the BMT cache/being fetched from memory), or the update has not completed (e.g., MAC is still being calculated).
The role of the scheduler is to decide when a persist can proceed to updating the next BMT level. To illustrate the working of the scheduler, suppose a new persist request is encountered. An entry is created in the WPQ to hold the data, counter, and MAC to persist. Concurrently, a new PTT entry is also created (Step 1 ), initialized to point to the corresponding WPQ entry, with the PendingNode labeled with the appropriate leaf BMT node (i.e. MAC of counter block). The valid bit is set, while the ready and persist bits are reset. In Step 2 , the BMT cache is looked up for the PendingNode. Fig. 6 : Example of in-order pipelined update mechanism with Persist Tracking Table ( PTT) for SP.
If found (BMT cache hit), a new MAC is calculated and the node updated. If not found (BMT cache miss), the node is fetched from memory, and the update commences after the node arrives from memory and is verified for integrity. Once the BMT node at the current level is updated, the R bit is set.
For the scheduler to allow persist entries to move on to the next BMT levels, it waits until the R bits of these entries are set (Step 3 ), indicating completion of udpates to the current BMT levels. Once the bits are set, the scheduler wakes up the entries to move on to the next BMT levels. The PendingNode is input into the Next Node Logic to yield the ID for the next node to update (Step 4 ). When the oldest entry (δ 1 ) finishes updating the BMT root, the entry's P bit is set and the WPQ is notified of BMT root update completion (Step 5 ). Afterward, the entry occupied by δ 1 can be released, the head pointer updated, and execution continues. At the WPQ, if BMT root update completion notification is received, and other tuple items are completed (data, counter, and MAC), tuple items are marked as persisted and become releasable to memory.
B. Epoch Persistency Model: Out-Of-Order Pipelined BMT Updates
The previous PTT architecture is not capable of managing BMT updates with EP model with OOO updates of BMT nodes, as it enforces in-order pipelined updates. What is unique with EP is that there are two persist ordering policies: enforced ordering across epochs but not within an epoch. Thus, we split the PTT design into two tables: an epoch tracking table (ETT) to track epochs while relegating the PTT to only track persists. Furthermore, coalescing makes the PTT more sophisticated, as it must be able to calculate and track coalescing points of multiple persists. For these reasons, Figure 7 shows the ETT/PTT split design and also the format of the PTT entries that enable OOO updates and coalescing.
An ETT is a circular buffer maintaining the order of active epochs. An ETT entry has the following fields: EID (epoch Figure 7 illustrates the tables with an example. There are a total of five persists, with the first three persists from Epoch1, while the fourth and fifth persists are from Epoch2 and Epoch3, respectively. For example, the entry for Epoch1 at the ETT has Start = 0 and End = 2 to indicate that PTT indices 0..2 contain information of the persists of Epoch1. δ 1 , δ 2 , and δ 3 are within the same epoch, and hence they perform OOO updates on the BMT root. In the example, δ 3 has updated BMT root X1 (hence in the PTT, P = 1 and V = 0), while δ 1 is working on updating BMT root X1 (hence in the PTT, P = 0 and V = 1). Since δ 3 has persisted, its respective entry can be released from the WPQ assuming all components of the security tuple have been received. δ 2 , on the other hand, has not reached BMT level 1 but has finished updating BMT node X21 (hence in the PTT, R = 1. Since Epoch1 is still working on BMT level 2 node and it is the lowest level that any persist of Epoch1 is working on, in the ETT, Epoch1's Lvl = 2. Epoch2 and Epoch3, consisting of one persist each, are updating different nodes (X33 and X47, respectively) at different BMT levels (level 3 and 4, respectively).
The figure illustrates that we exploit two types of parallelisms: epoch-level as well as persist-level parallelism. Within an epoch, we allow updates to occur OOO. Across epochs, we pipeline updates to the BMT in the epoch order using ETT to track and enforce correctness. The ETT mechanism for pipelining works similarly to the PTT mechanism for pipelining for SP, but with several modifications. First, the ready bit of an epoch is set only when all its persists' ready bits are also set. The Lvl of an epoch is determined as the maximum of Lvl field of all the persists of the epoch. With this, ETT can ensure that each BMT level can only be updated by persists of a single epoch, which avoids cross-epoch WAW hazards. When all persists of an epoch's are completed within the level(s) that are recorded, an epoch's R bit is set. When all epochs' R bits are set, the epoch-level scheduler is invoked to advance the epochs to the next levels. If an epoch is at level 1 and its completed, the entry can then be deallocated.
Scheduling at the PTT is also modified. In SP, persists update the BMT in a pipelined lockstep fashion. With EP, the persist's EID is used to check which level the persist is authorized to update. In the example in the figure, δ 5 cannot advance to level 3 because Epoch3 is only authorized to update level 4 of the BMT. Apart from epoch-level restriction, each persist can advance to the next level independently of other persists. Hence, assuming the level is authorized, persistlevel scheduler allows a persist to advance to the next level whenever R = 1 for the persist.
C. Epoch Persistency Model: Coalescing BMT Updates
To coalesce updates within an epoch, we first need to find the common ancestors. We adopt a BMT node labeling scheme based on the previous work [16] . A unique label is assigned to each BMT node starting from 0 for the BMT root. To find the parent of each BMT node, we subtract one from the label of current node and divided by the arity of the BMT to get the label of its parent. Then we can round this process down until the label 0 to get a list of all its ancestors. The least common ancestor (LCA) between two leaf nodes can be found from the longest prefix match between the two ancestor lists.
Next, we need to decide where to coalesce and how to determine which persists are coalesced together. Consider that it is likely that two persists from the same epoch will share many BMT nodes that are common. Coalescing can occur at any such node. However, the closer to leaf the common ancestor node is, the more effective coalescing become as more updates are eliminated. Therefore, an important principle for update coalescing is to coalesce at LCA whenever possible. The optimal coalescing occurs when the minimum number of updates is achieved. It requires each persist to be compared to every other persist in an epoch, and each pair that has the lowest LCA combined. Then, each combined pair is compared against every other BMT node or pair, and recombined, etc. However, this iterative approach is too costly for hardware implementation. Instead, we opt for paired coalescing, in which we always coalesced the new persist with previous one if it has not been coalesced with other persists.
D. Streamlining Counter Tree Updates in Intel SGX
Intel SGX utilizes a "counter tree" to verify memory integrity. Similar to BMT, the counter tree does not cover data because it assumes stateful MAC that protects against spoofing and splicing. The counter tree protects both the integrity and freshness of counters. However, unlike BMT, a counter tree requires the parent counter value to compute the MAC of child counters. As a result, to enable crash recovery, the parent counter value needs to be available and correct in order to compute the correct MAC value. On a store that persists, the tree's entire path from leaf to root nodes must also be persisted, instead of just the tree root.
Therefore, two changes are needed for crash recovery correctness. First, Invariant 1 redefines a memory tuple as consisting of data ciphertext, counter, MAC, and all nodes of the counter tree from leaf to root along the update path. Consequently, Invariant 2 expands to include all nodes in the counter tree update path from leaf to root, in contrast to BMT which only requires the tree root to provide crash recovery. This leads to higher costs than BMT. For example, the number of updates that must persist for one store would scale by the height of the counter tree. Although the optimizations described for BMT can be adapted to SGX, we focus only on BMT due to the extra cost incurred by the counter tree.
VI. EVALUATION
We use the cycle-accurate simulator Gem5 [5] to model the architecture design described in Section V. The configuration of the simulated system is presented in Table II . 
A. Methodology
Similar to previous work, [30] , [58] , [62] , we utilize speculative execution for encryption/decryption mechanisms. Discrete BMT, MAC, and counter caches are implemented for all schemes discussed, with the configurations in Table II . To enforce persist ordering, we implemented write through caches to persist each store to the MC. For pipelined BMT updates, we maintain a PTT with 64 entries. To support OOO BMT updates and coalesced BMT updates, we use a 2-entry ETT (i.e., only allow two concurrent epochs while enforcing the order between them) and a PTT with 64 entries shared by the two epochs. An sfence operation is also emulated to prevent stores from younger epoch being persisted to the memory before the stores in the elder epoch has been persisted. For our coalescing update model, we adopt a simple LCA search mechanism where two adjacent updates to the BMT can be coalesced each time, with the leading store stopping at the LCA and delegating the root update to the trailing store.
We use 15 representative benchmarks from SPEC2006 [21] As x86 ISA has a limited number of general purpose registers, it results in significant spills-and-refills in stack. Considering that persistent data structures mostly locate in heap or static/global region, we propose to not protect the stack region by default. The results where we evaluate full memory protection are labeled with ' full'.
B. Evaluation

1) Strict Persistency:
In this experiment, we analyze the following schemes:
• sp full: Atomic SP for entire memory • pipeline full: pipelined write update for entire memory • sp: Atomic SP for non-stack memory • pipeline: Pipelined SP for non-stack memory Figure 8 shows the execution time of these schemes normalized to the secure WB scheme. We can make two observations. First, SP incurs very high performance overhead, an average of 7.2×/30.7× for non-stack/full memory protection. The key reason is the high cost of persists. In Table III , we present the number of persists in different schemes. Take the benchmark, gamess as an example, it has 52 (non-stack) stores per kilo instructions. As each memory update needs to traverse the BMT from leaf to root and the MAC computation at each BMT level takes 40 cycles, it takes a total of 360 cycles to persist the BMT root in the 9-level BMT. As a result, the BMT update latency dominates the performance for this benchmark and we can estimate its performance in IPC (instruction per cycle) as 1000/(360*52) = 0.053, which is very close to the actual IPC, 0.054, of the SP scheme for gamess. Considering the IPC of the secure WB model for gamess being 2.45, the slowdown is 45.3× as shown in Figure 8 . If we choose to protect the entire memory (101 stores per kilo instructions), the slowdown would further increase to 88.9×. Some benchmarks such as leslie3d and bwaves also have high numbers of persists but show lower slowdowns than gamess. The reason is that the secure WB model for these benchmarks have low IPC due to their high numbers of dirty-block evictions from LLC.
Second, by overlapping the MAC computation latency, our proposed pipeline model reduces the performance overhead of SP from 7.2×/30.7× to 2.1×/6.9× for non-stack/full-memory protection. To better understand the impact of MAC latency, in the next experiment, we vary the MAC latency from 0, 20, 40 and 80 cycles. We also simulate idealistic meta-data caches (MDC) to study the impact from these caches. The results are shown in Figure 9 . From the figure, we can see that the main performance bottleneck for SP is indeed the MAC computation latency. Even a 20-cycle MAC computation latency leads to a slowdown of 3.2× on average. The MDC has negligible impact in comparison. Our results show that using EP scheme to protect non-stack memory, O3 and coalescing model reduces the performance slowdown to 20.7% and 20.2%, respectively, compared to the secure WB model. The performance improvements mainly come from OOO BMT updates, which enables aggressive overlapping of MAC latency. Furthermore, a large epoch also reduces the number of stores that need to be persisted if they update the same cache line. Such reduction is reported in Table III . On the other hand, if stack memory needs to be protected, the frequent updates lead to higher performance overhead, 2.42× and 2.35× for O3 and coalescing model, respectively. The results in Figure 10 also show that coalescing has limited impact on performance. The reason is that in order to coalesce updates, the older update would wait for the younger one to reach the LCA. Therefore, the saving from coalescing is mainly the number of updates to the BMT. Our experiments show that our coalescing scheme reduces the BMT updates by 26.1% on average.
Another interesting observation from Figure 10 is that our optimized epoch persistency model can achieve slightly better or equal performance compared to secure WB model for some benchmarks like milc. The reason is that in the secure WB model, the evicted dirty blocks from LLC perform BMT updates sequentially rather than the OOO pipelined manner in our optimized model.
3) Epoch size: Figure 11 shows the performance results of varying the epoch size for our optimized epoch persistency model. Besides determining the size of the PTT in our design, the epoch size has interesting performance implications. On one hand, large epochs enable higher reductions in the number of persists, i.e., making a better use of WB caches, as shown in Figure 12 . On the other hand, large epochs lead to bursty memory updates at the end of each epoch. In contrast, small epochs smooth the write traffic and benefit from eager writeback [28] at the cost of higher numbers of persists. In the extreme case, when the epoch size is 1, our epoch persistency model is essentially the same as the SP model. This performance tradeoff is evident in our results shown in Figure 11 . For small epoch sizes less than 16, multiple benchmarks show high performance overhead due to the high number of persists. For the large epoch size of 256, benchmarks such as gamess, milc, and zeusmp exhibit inferior performance than that for the epoch size of 128. Based on such performance trends, we choose the epoch size of 32 due to its good performance at relatively low hardware cost. 
4) Metadata Cache Size:
In this experiment, we vary the metadata cache size from 32KB to 256KB. The metadata caches include a counter cache, a MAC cache, and a BMT cache. Our results show that our persistency models are not sensitive to the metadata cache capacity and there is up to 2% performance difference when we change the cache sizes.
VII. CONCLUSIONS Memory integrity verification and encryption are essential for implementing secure computing systems. Atomically per- sisting memory integrity tree roots is responsible for the majority of the overhead incurred by updating security metadata. In this work, we presented three optimizations for atomically persisting NVM Bonsai Merkle Tree roots. With a strict persistency model, our proposed pipelined update mechanisms showed an 3.4× performance improvement compared to sequential updates. Within an epoch persistency model, our out-of-order root update and update coalescing mechanisms showed additional performance improvements of 2.8× over sequential updates. These optimizations significantly reduce the time required to update integrity tree roots and pave the way to make secure NVMM practical.
