OS-based page sharing is a commonly used optimization in modern systems to reduce memory footprint. Unfortunately, such sharing can cause Flush+Reload cache attacks, whereby a spy periodically flushes a cache line of shared data (using the clflush instruction) and reloads it to infer the access patterns of a victim application. Current proposals to mitigate Flush+Reload attacks are impractical as they either disable page sharing, or require application rewrite, or require OS support, or incur ISA changes. Ideally, we want to tolerate attacks without requiring any OS or ISA support and while incurring negligible performance and storage overheads.
INTRODUCTION
Caches alleviate the long latency of main memories by providing data with low latency. Unfortunately, the timing difference between the cache hit and a cache miss can be used as a side-channel by an adversary to infer the access pattern and obtain unauthorized information from the system. Such cache attacks have been demonstrated to leak sensitive information like encryption keys [1] and user browsing activity in the cloud [2] . More recent attacks like Spectre [3] and Meltdown [4] use cache-based side-channels to convert sensitive data accessed illegally into discernible information. Given their widespread impact, there is a pressing need for efficient mitigation of such cache side-channels. At the same time, it is also important to preserve the benefits of the performance optimizations, that are often the cause of these side channels.
OS-based page sharing [5, 6] is a commonly used optimization in modern systems to avoid redundant copies of pages across applications (such as for library code) or for supporting multiple copies of the same data pages (by deduplicating data pages). Such sharing allows different programs accessing the same library code to get routed to the same OS page. While OS-based page sharing is useful for memory capacity, such sharing leads to cache side channels between processes, even though the pages are shared only in read-only mode.
In this paper, we study cross-core Flush+Reload attack [7] , a highly effective attack that uses the OS-shared pages between the attacker and the victim. In this attack, the attacker periodically evicts a cache line that is shared with the victim from the cache using the clflush instruction, waits for a predefined interval, and then uses a timing check to infer whether the victim application accessed that line in the interim.
We explain Flush+Reload attack with an example. Figure 1 shows an example of Flush+Reload attack for a given line, Line-X, of shared data. At time t0, the state of Line-X is shown (V denoting the valid bit, the tag Tag-X and the data Data-X). Line-X is shared between the spy and the victim. At time t1, the spy program (running on Core-Spy) flushes this line using a clflush instruction. This instruction invokes a Flush-Caused Invalidation (FCI) that invalidates the given line in the cache. The spy program then waits for some time. At a later time t2, the victim program (running on Core-Victim) may access the Line-X. This access will miss in the cache and the cache controller will retrieve the data for Line-X (Data-X) from memory and install it in the cache. At time t3, the Core-Spy accesses Line-X and measures the time for the access to determine if Core-Victim accessed the line during the waiting time. The spy can use the patterns of hits and misses to infer secret information of the victim.
Prior solutions [8, 9, 10, 11, 12, 13] that are highly effective at guarding the cache against conflict-based attacks (such as Prime+Probe [14] ) become ineffective for Flush-based attacks, as the attacker can precisely invalidate a given line from the cache, without relying on any conflicts. Current proposals to mitigate Flush-based cache attacks primarily rely on OS-based or software-based techniques. For example, cache attacks on shared data can be avoided by disabling sharing of critical pages or by dynamically replicating shared pages [15] , or rewriting the application using transaction memory semantics [16] or using performance monitoring counters to detect deviation in behavior of the application [17, 18] . Ideally, we want to protect Flush-based attacks while retaining page sharing, without requiring a rewrite of victim applications, and without relying on profile information for the performance monitoring counters. The victim program on Core-Victim accesses Line-X at t2. At t3, the spy accesses Line-X and uses a timing test to check for a cache hit, and infers that the victim accessed Line-X. Our solution marks lines invalidated due to Flush-Caused Invalidation (FCI) as a "Zombie" line and monitors hits and misses to Zombie lines to tolerate attacks.
Architectural solutions for mitigating Flush-based cache attacks fall into two categories: Restriction and Duplication. SHARP [19] proposes to restrict the use of clflush in usermode on read-only pages. However, such a solution requires changing the ISA definition of clflush and is not backwards compatible. NewCache [20] proposes to access the cache line with Process-ID and the line-address -this creates duplicate copies of the line in the NewCache if two different processes concurrently access the same line, thereby avoiding the flush of one process from evicting the line of another. Unfortunately, such line duplication is incompatible with inter-process communication. DAWG [21] also proposes to create a replica of the shared line for each Domain-ID, and the communicating threads (or processes) must be grouped within the same Domain. However, DAWG requires OS support for cache management and for grouping applications into security domains. The goal of this paper is to develop a hardware-based solution that does not require any OS support, does not require ISA changes, does not restrict inter-process communication, and still mitigates the attack while incurring negligible overheads. Our solution leverages the hardware events that are inherent in such attacks. This paper makes the key observation that when a cache line is invalidated due to Flush-Caused Invalidation (FCI), the tag and data of the invalidated line are still resident in the cache and can be used for detecting Flush-based attacks. For example, in Figure 1 at time t1, the FCI resets the valid bit of Line-X while still leaving Tag-X and Data-X in the cache line. We call such lines invalidated due to FCI that still contain meaningful Tag and Data information as Zombie lines.
Our solution uses Zombie lines to detect Flush-based attacks and has four parts. (1) Mark the Zombies: To enable monitoring of Zombie lines, we extend the tag entry of the cache line to include a bit (Z-bit), to identify it as a Zombie line (at time t2, Line-X would have Z=1). (2) Protect the Zombies: Conventional replacement algorithms are designed to preferentially pick invalid lines on a cache miss. To protect the Zombie lines, we modify the replacement policy to treat an invalid Zombie line similar to a valid line, and evict the line only when it would have been naturally evicted if it had not received an FCI. (3) Deterministic Victim Selection on Zombie Miss: On a cache miss, if there is a tag hit on the invalid Zombie line, we know that the line was recently invalidated due to an FCI. We detect such cases, and modify the replacement algorithm to always pick the invalid-zombie line as the victim on line install in such scenarios. We retain Z=1 if the data obtained from memory was identical to the data resident in the line, as this denotes a case where the flush of the line was done unnecessarily. (4) Mitigate on Zombie Hit: A hit to a valid zombie line invokes mitigating actions that tolerate the attack by avoiding leakage of timing information.
We propose Zombie-Based Mitigation (ZBM), a simple hardware-solution to tolerate cross-core Flush+Reload attacks. ZBM simply treats Zombie hits as cache misses, that incur the latency of a memory access -by inserting a dummy memory access, waiting for it to complete, and only then returning the data to the requestor. This eliminates any timing channel for the attacker as both cache hits and misses incur the same latency once the line is marked as a zombie line.
We analyze robustness of ZBM using three spy programs (similar to prior works [16, 19] ): (1) Attacking AES T- Tables  (2) Attacking Square-and-Multiply algorithm of RSA, and (3) Monitoring a victim's function usage with Function Watcher. We demonstrate that ZBM successfully mitigates these attacks by closing the timing channel of cache line flushes.
ZBM only requires 1-bit per cache line (Z-bit) in the shared L3 cache (private L1/L2 caches are unchanged) and causes no slowdown for applications without flush and reload of identical contents. For niche applications like persistent memory, non-coherent I/O, etc. where frequent flushes are possible, we avoid slowdown by extending our design to ZBMx that also tracks the flush causing core-id (3 bits per cache line) in addition to the Z-bit per line. With ZBM and ZBMx, we are able to avoid slowdown for all typical non-attack scenarios.
Overall this paper makes the following contributions:
1. To the best of our knowledge, this is the first paper to use the inherent state of the cache to detect Flush-based cache attacks. We mark the flushed lines as zombies and check for reloading of identical content on install.
2. We propose a simple hardware mitigation of cross-core Flush+Reload attack by servicing hits to zombie lines as misses. Our solution mitigates the attack and incurs no slowdown for typical benign applications.
3. We show our solutions (ZBM and ZBMx) can be implemented with negligible storage (1-4 bits per cache line), and without any changes to the OS, software, or ISA.
BACKGROUND AND MOTIVATION
Modern computing systems rely on sharing the resources such as the last-level cache and main memory across processes to improve efficiency. While such sharing is useful for performance and reducing cost, the sharing of resources can create side channels. We discuss the background on OSbased page sharing, the settings in which such sharing could lead to a cross-core attack, and typical forms of cache attack.
OS-Based Page Sharing
OS-based page sharing reduces memory footprint by removing redundant copies of identical pages. Such sharing is essential for sharing of the text segment of executable files between processes and for using shared libraries. Furthermore, memory deduplication is a popular technique to explicitly identify memory pages containing identical contents in unrelated pages and coalesce these pages into a single unit. Memory deduplication has been implemented in a variety of systems [5, 6] , including VMWare and PowerVM hypervisors, and in Linux and Windows. While read operations are permitted on deduplicated pages, a write operation results in a copy-on-write exception (to replicate the page and map it into the process). While page sharing is useful for effectively utilizing memory capacity and for enabling shared libraries, it can lead to side channels which can be exploited by an attacker, even though the pages are shared in read-only mode.
Attack Model
In this paper, we focus on cross-core cache attacks, where the victim and the spy are executing on separate cores of a multi-core processor. This is a safe assumption in cloud computing environment, where the non-trusting applications are not concurrently scheduled on the same core. Figure 2 shows our system configuration, which contains a multi-core processor with private L1 and L2 caches, and a shared L3 cache. The L3 cache is inclusive and evictions from the L3 cache cause evictions from L1 and L2 (if the line is present). As L3 cache is the point at which resources are shared, the adversary tries to orchestrate evictions in the L3 cache and monitor the hits in the L3 cache to observe the behavior of the victim program. Such cache attacks have been used to infer secrets such as the keys for AES [1] . 2.3 Example: RSA Square-And-Multiply Figure 2 shows the code for the square-and-multiply algorithm used in RSA implementation of GnuPG version 1.4.13 that is vulnerable to cache attacks (recent versions have moved towards secure implementations). The algorithm computes (b e ) mod m, i.e. "b raised to the power of e, modulo m". The algorithm iterates from the most-significant bit of e (the secret key) to the least-significant bit, always performing a square operation (top arrow) and performing the multiply operation (bottom arrow) only if the bit is "1". By observing the access pattern for lines containing the square (sqr) and multiply (mul) functions, the spy can infer the bits of the secret (e) -sqr followed by sqr is a 0 whereas sqr followed by mul is a 1. The lines containing the instructions for the sqr and the mul functions are called probe addresses.
The spy can infer the access pattern of the victim by causing an eviction of the probe address, waiting, and then testing if the probe address was accessed by the victim (by checking if a hit is obtained for the probe address). Depending on how the evictions are performed, cache attacks can be classified as either (a) conflict-based attacks or (b) flush-based attacks.
Conflict-Based Attack and Mitigation
In conflict-based attacks [14] , the attacker fills a cache set with its own lines and causes a conflict miss on one of the victim lines. The attacker uses a timing test to see if any of the installed lines encounter a miss -if so, the attacker can infer that the victim accessed the particular set. Fortunately, conflict-based cache attacks can be efficiently mitigated by using cache-space preservation [8, 9, 21, 22] or by randomizing the location of the line in the cache [8, 12, 13] . Without loss of generality, in our study, we assume the cache is protected against conflict-based attacks using randomization [13] and we focus on only flush-based cache attacks.
Flush-Based Cache Attack
Cache attacks do not always have to use load/store instructions to cause cache evictions. They can use an instruction called Cache Line Flush (clflush), which explicitly invalidates the cache line from all levels of the processor cache hierarchy, including the instruction and data [23] . The clflush instruction is conventionally provided by the system to support non-coherent IO, whereby the IO device can write directly to memory and use clflush instructions to ensure that the process can read the up-to-date copy from memory instead of the stale copy from the cache [23] .
Flush-based attacks target accesses to the memory locations that are shared between the attacker and the victim. In a Flush+Reload attack, the spy invalidates the shared line using the clflush instruction, waits, and then checks the timing of a later access to that line -if the access incurs shorter latency, then the attacker can infer that the victim application has accessed the given line. While Flush-based attacks are restricted to only shared pages, they are more powerful than conflict-based attacks, because the spy can learn about the exact line being used instead of just the particular cache set.
While users can ensure that data pages containing sensitive data are not explicitly shared with an untrusted application, shared libraries can end up being implicitly shared by the OS. Thus, using library code (like RSA or AES for encryp-tion and decryption) can allow an attacker to monitor the access pattern to different functions and cryptographic tables within those functions, to infer secret information such as cryptographic keys. For example, for the square and multiply routine showed in Figure 2 , the adversary can flush the line corresponding to square and multiply functions and learn the access pattern of a victim. We want to develop efficient solutions for tolerating cross-core Flush+Reload attacks. 1 
Prior Solutions for Flush+Reload Attack
Current proposals to mitigate Flush+Reload attacks primarily rely on OS-based or software-based techniques. For example, cache attacks on shared data can be avoided by extending the OS to disable sharing of critical pages. Zhou et al. [15] proposed a scheme to dynamically replicate shared pages if multiple processes are concurrently accessing such pages, thus giving up on the capacity benefits of page sharing. Gruss et al. [16] proposes to rewrite safety-critical software using transactional memory semantics, which means transactions that have concurrent memory accesses to shared location by other processes will cause transaction abort and avoid leakage of timing information to concurrently running applications. Prior studies [17, 18] have also suggested using hardware performance counters to observe deviation in the behavior of applications to detect attacks, assuming profile information of the applications (during attack-free scenario) is available. Unfortunately, such profile information may not be available for all applications that use shared pages.
Architectural solutions for mitigating flush-based attacks fall in two categories. First, redefining the usage of clflush. For example, SHARP [19] proposes to restrict clflush in user-mode from flushing read-only pages. Such a solution requires changing the ISA definition of clflush and is not backwards compatible. Second, creating in-cache duplicates of the shared line, so a flush triggered by one process cannot dislodge the line brought in by another process. For example, both NewCache [20] and DAWG [21] use in-cache line duplication to mitigate flush-based attack. Unfortunately, such a solution of line duplication is either incompatible with inter-process communication (NewCache) or may require careful placement by the OS to designate the communicating processes within the same security domain (DAWG).
Goal of Our Paper
The goal of this paper is to develop a practical solution to mitigate cross-core Flush+Reload attacks. For a solution to be useful, it is important that it not only provides strong protection against attacks but also has (1) Negligible performance overheads when the system is not under attack, (2) Negligible hardware overhead, (3) No restriction on capacity benefits of page sharing, (4) No requirement of rewriting the software, (5) No changes to the OS or the ISA, and (6) No limitation on inter-process communication. Our paper develops a practical solution to tolerate attacks by leveraging the hardware properties that are inherent in flush-based attacks. We describe our solution next.
MITIGATING ATTACKS VIA ZOMBIES
Our objective is to differentiate an "attack" from a benign use-case of clflush like non-coherent IO. We observe that for the intended use of clflush, the data value resident in memory is expected to change between the flush and subsequent read to the line. For an attack however, a given line is repeatedly flushed, and the data read from memory is unchanged, compared to the data present in the line when it was invalidated due to the flush. If we have the line address and data of the lines that were invalidated recently due to a flush, we can use it to detect Flush+Reload attacks.
We leverage the key insight that when a cache line gets invalidated due to a Flush-Caused Invalidation (FCI), the valid bit of the cache line is reset; but the Tag and Data of the invalidated line continue to be present in the cache in the same location. In fact, we can use them for comparison with the Tag and Data of lines installed on subsequent cache misses, to efficiently detect and tolerate Flush+Reload attack.
Determining a Zombie Line
We call a line invalidated due to Flush-Caused Invalidation a Zombie line, as it still retains its Tag and Data in the cache.
(Note: Invalidations due to other reasons such as coherence, system restart, etc. are not deemed Zombies). To track the Zombie status, we add a Z-bit (Zombie bit) to the tag entry of each line. The Z-bit for a line is set to 1, only when it is invalidated due to an FCI. Once set, the Z-bit is reset to 0, only when either the Tag or the Data for the line are updated.
Basic Operations for Zombie Scheme
Our Zombie-based design contains four steps: (1) Mark the Zombie (2) Protect the Zombie (3) Actions on ZombieMiss (4) Actions on Zombie Hits. We explain these steps with an example, as shown in Figure 3 , which is identical to Figure 1 , except that Line-X now also contains the Z-bit.
At time t0, the line is resident in the cache with tag as Tag-X and data as Data-X. The valid bit (V) is 1 and the line is not a zombie (Z=0). At time t1, the line receives a Flush-Caused Invalidation (FCI) from the spy.
Step-1. Mark the Zombie on FCI: We assume the cache is equipped with signals to identify an invalidation due to clflush. On receiving a FCI, the valid bit for the line is reset (V=0) and the line is marked as a zombie line (Z=1). If the line was dirty, the contents are written back to memory (the zombie status is independent of whether the invalidated line was clean or dirty). After the FCI, the line continues to retain Tag-X and Data-X.
Step-2. Protect the Zombie Until Natural Eviction: Replacement policies preferentially pick invalid lines as replacement victims on a cache miss. So, an invalid zombie line can be quickly dislodged if there is another miss to the same set. To avoid this, we ensure a zombie line resides in the cache until it would have been naturally evicted, in the absence of a FCI. We achieve this by modifying the replacement algorithm to not pick an invalid zombie line as a victim, unless it would have been picked anyway based on its recency/reuse status. For example, for LRU replacement policy, the invalid zombie line is not victimized until it becomes the LRU line. All through the residency of the invalid-zombie line in the cache, Detecting and Mitigating Flush+Reload attack using Zombie. A line that is invalidated due to clflush is marked as a zombie (Z=1). A cache miss that has matching tag for a zombie line is termed Zombie-Miss -incoming line is installed in the Zombie line and Z=1 is maintained if incoming data matches resident data. A cache hit for a line with Z=1 is termed as Zombie-Hit. Our solution mitigates attack by servicing the Zombie-Hit with a latency of a miss.
the line continues to get replacement status updates similar to valid lines (so, for LRU replacement, the invalid zombie line would traverse all the way from MRU to LRU, and only then get evicted). Protecting invalid zombie lines does not slow down the system as flushes are typically performed only for few (tens of) lines, whereas our L3 cache with randomized indexing [13] has tens of thousands of locations where new line accesses and installs may potentially map to.
Step-3. Act on Zombie-Miss (tag hit on invalid-zombie):
Conventional cache designs skip the tag match for invalid lines, while probing a set on a cache access. Instead, we perform the tag match even for invalid zombie lines, to identify a cache miss where an invalid zombie line with a matching tag was present in the cache set. We deem such a miss to be a Zombie-Miss. In the event of such a Zombie-Miss, data is retrieved from memory similar to a normal cache miss. In addition, the incoming cache line is deterministically placed in the cache way where the Zombie line was located, before the valid bit is set. If the data retrieved from memory is different from the data originally in the Zombie line, it indicates a legitimate use of clflush (e.g. an asynchronous IO modified the memory contents between the flush and the access), so we reset the zombie status (Z=0) of the line. However, if the data retrieved from memory is identical to the data originally in the Zombie line, then the Zombie status is retained (Z=1), as the flush was unnecessary and the line could be under attack. This scenario is shown at time t2 in Figure 3 . Step-4. Act on Zombie-Hit (tag hit on valid-zombie): On a cache hit, our design checks the Z-bit of the line in parallel with the tag-match. If a cache hit occurs on a valid-zombie line, we deem it as a Zombie Hit. In a conventional nonsecure design, an attacker could use such hits on previously flushed lines and the corresponding short access time, to infer a victim access to the line. Our design can detect such Zombie-Hits and invoke mitigative actions to close the timing leak, as shown at time t3 in Figure 3 . Table 1 summarizes the cache operations using the zombiebit in our design. Note that we restrict zombie-bit and associated operations only to the L3 cache, as it is the shared cache in our system, and the one exploited in the cross-core attack.
Zombie-Based Mitigation (ZBM)
Given that Zombie-Misses and Zombie-Hits are intrinsic to the Flush+Reload attack, tracking their episodes can help detect an attack. For example, hardware counters can be maintained that increment on Zombie-Misses and ZombieHits, and if the count exceeds a certain threshold, a potential attack can be flagged to the OS to activate mitigating actions. However, we ideally want a solution that can mitigate the attack transparently in hardware, without requiring any OS support or threshold-based decisions.
Our insight is that the timing difference between a ZombieHit and Zombie-Miss is vital for a successful Flush+Reload attack -eliminating it can completely mitigate the attack. This is because the attack uses timing to guess the outcome of the spy-access following the flush to leak information. A fast access (Zombie-Hit) implies the victim accessed the line between the flush and spy-access, whereas a slow access (Zombie-Miss) implies the victim did not access the line. To eliminate the timing difference between Zombie-Hits and Zombie-Misses and prevent any information leakage, we propose a simple hardware-based scheme, Zombie-Based Mitigation (ZBM) that delays Zombie-Hits and makes them incur the same latency as a Zombie-Miss.
On Zombie-Hits, ZBM adds delays by triggering an extraneous memory access for the same line, and returning the cached data to the processor only after this dummy request completes (the data obtained from the memory is ignored and not installed in the cache). As this exactly emulates the operations (memory access) on a Zombie-Miss, it incurs a similar long-latency. Thus, during the Flush+Reload attack, the spyaccess in the reload phase leaks no information -the access incurs the high latency of a cache miss and memory access, regardless of whether the line was accessed by the victim (Zombie-Hit) or not (Zombie-Miss). Note that a ZombieHit is recorded as a Cache-Miss by the cache performance counters, so no information is leaked even through program statistics. Figure 4 shows the time-line for Zombie-Miss/Hit in our design, illustrating the lack of a timing difference. The episode of flushing a line, retrieving it with identical content from memory and then accessing it again from the L3 cache and not L1/L2 cache, and doing so repeatedly, does not occur in typical applications. So, the higher latency (and additional energy overhead) due to the extra memory accesses on Zombie-Hits does not impact benign applications.
SECURITY ANALYSIS OF ZBM
We analyze the effectiveness of ZBM at defending against Flush+Reload attack using three representative attacks (adapted from prior works [16, 19] ): (1) Attacking the AES T-tables, (2) Snooping a victim's control flow during secret-dependent function execution, and (3) Attacking the Square-And-Multiply algorithm of RSA. We evaluate the attack on a baseline system without any mitigation and on a system with ZBM. Our system contains 8 cores sharing a 16MB L3 cache (system configuration is shown in Table 2 ). Despite using a noise-free setting that allows high attack fidelity, we show that ZBM can successfully mitigate all of these attacks.
Attacking AES T-Table Implementation
Commonly used crypto libraries like OpenSSL and GnuPG implement AES with T-tables, which have been shown to be vulnerable to cache-attacks that leak the secret key [1, 16, 25] . In such an implementation, in each of the 10 AES rounds, a total of 16 accesses are made to T-tables (lookup tables) spread over 16 cachelines. The table indices accessed in each round depend on the input plaintext p and a secret key k, with the first-round indices being (x i = p i k i ), where i is the byte number in the plaintext or the key.
We perform a chosen-plaintext attack on the first round. In such an attack, the spy supplies a series of plaintext blocks to be encrypted by the victim, keeping byte p 0 constant and randomly varying the other p i for i = 0. This ensures that a fixed index x 0 always receives a hit in the first round, and has a higher number of hits on average compared to other entries in the T-table. By identifying which T-table cacheline received maximum hits using a Flush+Reload attack over a large number of encryptions, the spy can decipher 4-bits out of 8-bits of the index x 0 (that determine the cacheline number) and thus extract 4 out of 8 bits of k 0 (as p 0 is known). By sequentially changing which plaintext byte is constant, the spy can recover 64 out of 128 bits of the secret key. In the baseline, as p 0 (and x 0 = p 0 0) increases from 0 to 255, the cacheline corresponding to x 0 (the table index with the maximum hits) discernibly changes from 0 to 15 as seen in Figure 5 , with each cacheline storing 16 entries. Compared to the average of 1368 hits/cacheline, the maximum observed is 1892 hits/cacheline. With ZBM, the spy does not see any pattern because ZBM always provides data with latency of memory access on both Zombie-Hits and Zombie-Misses. 2 
Function Watcher -Leak Execution Path
Just as a spy can snoop the data-access pattern of the victim, it can also snoop a victim's code-access patterns to leak information. Recently, such attacks were demonstrated on PDF processing applications [2] where a spy snoops the functions executed by the program to identify the rendered PDF. On these lines, we model a Function Watcher attack similar to prior work [16] , where a victim executes a call to one out of four functions based on a secret value (each function has 5000+ instructions with varying control-flow). The spy monitors the addresses corresponding to the entry points of these functions by repeatedly executing a cross-core Flush+Reload attack. On each reload following a flush, the spy observes which address received a cache-hit (lowest latency), to infer the function being executed by the victim.
2 In a few rare cases, we see a single hit for the spy across all the 10,000 block encryptions. This happens when a flush from the spy occurs before the victim had a chance to access the table even a single time. So on the flush, the Z-bit is not set as the cache does not even contain the line. Subsequent victim access followed by the spy access causes the cache hit for the spy. In the AES attack, all sixteen lines of the tables are accessed thousands of times, with varying probabilities. Knowing one line was accessed once by the victim at the start, is not meaningful information for the attacker. Here, a SQR-Hit followed by MUL-Hit leaks bit-value "1" in the key, whereas two SQR-Hits one after another denote a "0".
The heat-map in Figure 7 shows the percentage of attempts where the spy correctly (diagonal-entries) or incorrectly (nondiagonal entries) infers the function executed by the victim, over 10K function calls. In the baseline (Figure 7(a) ), the spy correctly infers the function ID of the victim (the secret) in most of the attempts (91% -96%).
In ZBM (Figure 7(b) ), the attacker successfully infers the correct function ID in only 23% -27% of the attempts, which is as likely as other incorrect function IDs (19% -31%). Thus, the attack with ZBM is no better than a random guess. This is because ZBM ensures that on a spy-reload, both Zombie-Hits (function entry-point accessed by victim) and Zombie-Misses (unaccessed addresses) have similar high latency, thus leaking no information to the spy. 
RSA Square-And-Multiply Algorithm
To evaluate the effectiveness of ZBM, we analyze an attack on the Square-And-Multiply algorithm in RSA described in Section 2.3, similar to prior work [19] . In this attack, the spy monitors the victim's accesses to the entry-points of square function (SQR) and multiply function (MUL), to extract the secret RSA key. Figure 6 shows the time at which spy-hits for SQR and MUL are encountered, in a time window of 10K -90K cycles for both the baseline (left) and ZBM (right).
In our analysis, we use a RSA key of 3072 bits where every 8th bit is a 1 (e.g key is 0000000100000001...). For the attack in the baseline, a hit to SQR followed by a hit to MUL indicates a "1" in the key, whereas a hit to SQR followed by another hit to SQR denotes a "0". So we expect the spy would have eight hits to SQR followed by one hit to MUL. In Figure 6 (a), the spy can observe this exact pattern after filtering noise (e.g.a spurious double hit for MUL at time 73K cycles is ignored, knowing the behavior of RSA). With ZBM, the spy always gets high latency on a reload for MUL and SQR -so the spy does not get any hits, as shown in Figure 6 (b). Thus, ZBM successfully mitigates the attack.
PERFORMANCE EVALUATION

Configuration
For our performance evaluations, we use a Pin-based x86 simulator. Table 2 shows the system configuration used in our study. Our system contains 8 cores, each of which have a private instruction and data cache and a unified 256KB L2 cache. The L3 cache is 16MB and is shared between all the cores. All caches use a line size of 64 bytes. We measure aggregate performance of our 8-core system using weighted speedup metric. We evaluate performance under two scenarios: (1) when the system is not under attack, executing only benign applications (common case), and (2) when system is under attack (uncommon case).
For benign applications, we use a representative slice of 250 million instruction of each of the 29 benchmarks in the SPEC2006 suite. The benign applications do not have a "Flush+Reload" pattern. We create a benign workload by randomly pairing 8 benign applications, and have 100 such mixed workloads. For attacks, we use 3 pairs of victim-spy (AES, Function Watcher and RSA) and combine each victimspy with six benign applications to create an 8-core workload. 
Impact on Aggregate System Performance
ZBM inflicts higher latency for a Zombie-Hit. For such scenarios to occur the line must be flushed, reloaded with identical content soon (before the line gets evicted from the cache) and then accessed again while the line is in the L3 cache. Such access patterns are extremely unlikely to occur in normal (benign) applications, therefore the impact of ZBM on system performance for benign workloads is expected to be negligible. Figure 8 shows the slowdown caused by ZBM for 103 workloads (100 benign + 3 under attack). For calculating slowdown we compute the ratio of weighted speedup of the baseline to the weighted speedup of ZBM. For all 100 workloads that do not have an attack, ZBM has no slowdown, whereas there is a slowdown (up-to 2.2%) for the three attack workloads containing the victim-spy pairs.
Slowdown for Applications Under Attack
We analyze the performance of the individual applications for the three attack workloads that experience a slowdown. Figure 9 shows the slowdown for each of the eight applications in the three workloads. The workload contains a spy, a victim, and six benign bystanders. For AES, ZBM causes 16.4% slowdown for the victim and 5.3% for the spy. For Function watcher, the slowdown is 1.2% for the victim and 2.1% for the spy. For RSA, the slowdown is 2.6% for the victim and 10.9% for the spy. Note that these slowdowns occur only under attack (when marginal slowdown is acceptable to protect against the attack). The bystander applications (marked as Benign 1-6 in Figure 9 ), which are neither a victim nor a spy do not see a performance impact. 
Storage and Logic Overheads
ZBM provides strong protection against attack while causing no performance overhead when the system is not under attack. To implement ZBM, the only storage overhead incurred is the per-line Z-bit. Therefore, to implement ZBM on our 16MB LLC, we would need a total storage overhead of 32 kilobytes, which is less than 0.2% of the LLC capacity.
ZBM requires that the incoming data from memory be compared to the data resident in the cache, if the Z-bit associated with the victim line is set. To implement a 512-bit comparator, we need 512 XNOR gates and 511 AND gates, for a total of approximately 1K gates. This logic overhead is negligibly small (for reference, computing ECC-1 for a 64-byte line incurs an overhead of approximately 3500 gates).
ZBM-X FOR FLUSHING APPLICATIONS
Our performance evaluations focus on workloads that are attacker-victim pairs (demonstrating the security of ZBM). The typical benign workloads we evaluate 3 do not contain patterns of Flush+Reload of identical data, and hence have no slowdown with ZBM. However, there could be other niche applications which use flushes frequently. For example, lines evicted with clflush in expectation of a non-coherent IO, where no IO occurs, could be read subsequently as Zombie Hits with ZBM. Other unintended usage of clflush can occur in persistent memory applications, which use it to evict and write dirty cache lines to memory, even though more suitable instructions like clwb (cache line writeback) exist.
Even with frequent flushes, ZBM's slowdown is expected to be minimal as several conditions need to be true at the same time to suffer slowdown -(a) application has to flush lines, (b) then reload identical data, and (c) the data has to be larger than private L2-cache (256 KB), so that subsequent loads to it are all serviced from L3-cache. This is uncommon, as subsequent loads are likely to be serviced, in most scenarios, from L1 or L2 cache without overheads (Z-bit is restricted to the L3 cache). Nonetheless, to understand the slowdown of ZBM for flushing applications, we develop an analytical model. Then, we design a simple extension of ZBM to avoid slowdown even in the uncommon scenario of frequent flushes. 
Analytical Model -L3Lat and Slowdown
Slowdown with ZBM is due to Zombie-Hits being serviced with the latency of a L3-cache miss instead of a L3-cache hit, to prevent leaking information. So, we first develop a model to estimate the average latency for a L3 access (L3Lat) with ZBM and then model its impact on execution time.
Modelling L3 Access Latency (L3Lat)
Let α be the miss-rate of the baseline L3-cache. Let t c be the number of cycles for a L3-cache hit and t m be the number of cycles to service a memory access. Then L3Lat for the baseline system (L3Lat base ) is given by Equation 1.
Let F be the percentage of memory accesses that occurred due to a flush of the cache line (line was present in the cache but got invalidated due to clflush). Let R be the probability that the reloaded content is identical to the content of the line at the time of the flush. Thus, F · R represents the probability of flush and reload of identical content. If all memory accesses were flush-reload of identical contents (i.e. F · R = 1), then we expect the L3-miss rate in ZBM to increase to 100%.
In general, for any given value of F and R, the miss-rate of the ZBM (α ZBM ) is given by Equation 2.
Using Equation 1 and 2, the L3Lat for the ZBM system (L3Lat ZBM ) is given by Equation 3.
Dividing Equation 3 by Equation 2, we get the normalized value (L3Lat norm ), as shown in Equation 4 and in Figure 10 .
Modelling Overall System Slowdown
Execution time of the baseline system (T base ) can be split into two components, the time required with a perfect L2-cache (T per f L2 ) and the time spent accessing L3 and memory (T L3mem ), as shown in Equation 5 .
For ZBM, T L3mem increases in proportion to L3Lat. Therefore, the execution time of the application with ZBM (T ZBM ) can be written as Equation 6 .
The slowdown of ZBM can be obtained by dividing Equation 6 by Equation 5, as shown in Equation 7 .
Impact on L3Lat and Slowdown
For our system, we use t c = 24 cycles and t m = 145 cycles. Without loss of generality, we assume an application that spends half of the execution time in L3/memory accesses (T base = 2 · T L3mem ) and has a baseline L3 hit-rate (α) of 50%. Using these values with Equations 4 and 7, we plot L3Lat and Slowdown in Figure 10 for different values of F and R. Our model shows that an application reloading identical data after flush for <10% of its accesses will have negligible slowdown with ZBM. Moreover, even pathological scenarios where all memory accesses are reloads of unchanged data after flush, have a slowdown that is less than 40%.
ZBMx (an extension) to Avoid Slowdown
It is expected that only an uncommon class of workloads incur slowdown with ZBM -with frequent flushes to majority of their working set, reload of identical data from memory to cache, and a high hit-rate in L3 for the reloaded contents. To avoid slowdown even in such uncommon scenarios, we extend ZBM to allow cache-hits on a Zombie-line, once the line had a reload from the flush-causing-core, that nullifies the effect of the flush. We call this design extension ZBMx.
ZBMx is implemented by tracking the Flush-Causing Core-ID (FCID), which is set along with the Z-bit, when the line is invalidated due to a clflush. When a subsequent access occurs from the same core (coreID matches the line's FCID), then the Z-bit of the line is reset. In this design, subsequent accesses after a Flush and Reload (by the same core) get the line with cache-hit latency, without any slowdown.
ZBMx incurs a storage overhead of 4 bits per line (Z-bit + 3-bit of FCID for an 8-core system), which is still less than 1% of the area of the L3-cache. Thus, system designers can implement ZBMx with minimal storage overhead, to avoid slowdown even for applications with frequent flushes like non-coherent IO, DMA or persistent memory applications where ZBM might be a potential concern.
SECURITY DISCUSSION
Here, we discuss how ZBM tolerates other attack variants.
Mitigating Flush+Flush Attack
Flush+Flush [24] attack exploits the variation in latency of the clflush instruction depending on whether or not the line is present in the cache, to infer the access of a victim. Such attacks can be prevented by always serving a flush with a constant worst-case latency. To limit slowdown to only a potential attack scenario, we can activate such constant-time service of flush only if the flush is to a line with the Z-bit set -as repeatedly executing a successful attack requires performing a flush on a recently flushed line (with Z-bit set).
Tolerating Alternate Eviction Mechanisms
Cache Conflicts: Cache-set conflict based attacks, like Prime+Probe [14] , can be prevented by using way partitioning [9, 21] or by randomizing the cache indexing [8, 12, 13] . We assume our system has randomized indexing [13] for its shared cache and directory [26] .
Natural Eviction or Cache-Thrashing: An adversary could wait for the victim's sensitive lines to get naturally evicted (e.g. by reaching LRU position in a set), through the victim's own accesses. Alternately, the adversary could attempt to evict all lines including Zombies by cache-thrashing (e.g. by accessing a large array). For conclusively evicting a zombie line in either case, with a randomized-indexing cache, the adversary would need to wait for tens of thousands of lines to be installed (to replace the entire contents of the cache). Such an attack is 10,000x slower than Flush+Reload, that needs just one access for eviction, and much less practical.
Non-Temporal Stores (NT-Stores) can also evict a cached line while writing to memory. However, this can be avoided with an implementation (like Pentium-M [23] ) that updates a line in-place in the cache without eviction, if the location is cached. This leaves common-case usage of NT-Stores unchanged, where stores directly write to memory when the location is uncached, 4 without costly Read-For-Ownership.
Tolerating Alternate Reload Strategies
Alternate attack variants such as Flush+Prefetch [27] and Invalidate+Transfer [28] perform the reload through a prefetch or a coherence operation, instead of a direct demand access. ZBM tolerates these attacks, as it always services zombie-hits on reload (regardless of its cause) with cache-miss latency.
Implications for Denial of Service (DoS)
An adversary can attempt a DoS attack by flushing shared lines and creating a large number of zombies in the cache. Subsequent victim accesses (zombie-hits) will have slowdown, until zombie lines are naturally evicted. Fortunately, this only impacts the L3 cache, as the zombie status is not propagated to L1/L2 cache -the application retains use of L1/L2 caches. Worse DoS-attacks not involving ZBM are possible even in the baseline, where an adversary can repeatedly flush lines from all levels of the cache hierarchy, disable the use of all caches and cause much higher slowdown.
Tolerating Main-Memory Attacks
While our primary goal is mitigating Flush+Reload cache attacks using Zombie-lines, we observe that Zombies can also help in detecting other attacks that require repeated use of clflush. In this section, we discuss the application of Zombie-based detection to DRAM and cache coherence attacks (unfortunately, evaluating these is beyond our scope due to space limitations). Compared to naively counting flushes, counting flushes-on-zombies could lower the false-positives in attack detection -as this can detect episodes of repeat flushes to the same line (common to such attacks) 7.5.1 Tolerating DRAM Timing Side-Channels DRAMA [29] exploits DRAM row-buffer timing as a sidechannel. This attack uses the timing difference between row-hit and row-miss to leak the access pattern. To execute at a sufficiently fast rate, the attack requires flush of the cache line in each iteration to ensure that the data is always read from memory. As these attacks incur frequent episodes of Zombie-Miss and then Flush-on-Zombie, we can count such episodes with dedicated hardware counters and detect a potential attack. When the counters cross a threshold, mitigation can be performed by switching to a closed-page policy.
Tolerating DRAM Row-Hammer Attacks
Row-Hammer attacks [30] flush two cache lines in each iteration, each corresponding to a different row in the same bank of DRAM. This causes frequent accesses to alternate rows, and a large number of row closures on the same rows within a short time-period. This injects faults in other neighbouring rows (due to faulty isolation in DRAM technology). Adversaries leverage this to inject bit-flips and modify access control bits in sensitive kernel data-structures, engineering illegal privilege escalation. While this attack can be mitigated by increasing the DRAM refresh rate, such a solution incurs performance and power overheads. Instead, as these attacks incur frequent episodes of Zombie-Miss and then Flushon-Zombie, detection is possible with dedicated hardware counters tracking these episodes. When the count exceeds a threshold, mitigation can be performed by increasing the DRAM refresh rate for a certain time period.
Tolerating Coherence-Based Attacks
OS-shared pages can leak information through the coherence status of their cached lines, as per recent attacks [31] . Such attacks exploit the difference in cache access latency, based on whether the line is in Exclusive (E) or Shared (S) state. The spy first initializes the line to E, then allows the victim to execute. Later, by checking if the line is in S (based on access latency), the spy can infer if the victim accessed it.
For this attack, re-initializing the line to E is essential on each iteration: achieved with a clflush to invalidate the line, and a subsequent load to re-install it in E (note: using stores to invalidate copies of lines leveraging coherence is not useful as that would unshare OS-shared pages). Hence, detecting episodes of repeated flush and reload to the same line, using hardware counters for Zombie-misses and Flusheson-Zombies, can help detect such attacks. When the counts cross a threshold, mitigation can be activated in hardware by servicing S and E accesses with similar (worst-case) latency.
RELATED WORK
Our proposal leverages the insight that when a cache line is invalidated on a flush, the tag and the data are still in the cache and can be used for detecting and tolerating flush-based attacks. It is partly inspired by past work on Coherence Decoupling [32] that used the resident tag and data of the lines invalidated due to coherence, for performing speculative operations (using stale values) while waiting for the coherence response. In this section, we discuss cache attacks in general and prior solutions to tolerate Flush+Reload attacks.
Types of Cache Attacks
Recent works [33, 34] have surveyed both Conflict-Based Attacks [14, 27, 35] that generate cache-conflicts to invalidate cachelines and Flush-Based Attacks [7, 24, 28, 31] that use flushes to invalidate cachelines. These have been used to attack algorithms like AES [36, 37] , RSA [38, 39, 40] , etc. in cryptographic libraries and leak the secret keys.
Tolerating Conflict-Based Attacks
Conflict-based attacks can be tolerated by either preserving the data of the victim or through randomization. The examples of preservation-based approaches includes PLCache [8] , CATalyst [41] , StealthMem [42] , SecDCP [22] and NonMonopolizable (NoMo) Cache [9] . Recent studies, such as Relaxed Inclusion Cache (RIC) [43] and SHARP [19] , have also targeted the inclusion property of LLC to provide preservation. The examples of randomization-based approaches to detect or mitigate such attacks include ReplayConfusion [44] , RPCache [8] , NewCache [12] , and CEASER [13] . However, such solutions are ineffective at guarding against flush-based attacks on OS-shared lines as the adversary can evict the line explicitly without any conflict or inclusion violation.
Tolerating Attacks with Software Support
Prior studies propose rewriting software to avoid storing critical information (such as encryption tables) in memory and instead compute it on-the-fly [14, 45] . Unfortunately, such revised implementations tend to be 2x to 4x slower than the original implementation [45] . Our solution avoids the software engineering effort and slowdown of such methods.
Another proposal [46] prefetches all the sensitive probeaddresses into the cache (e.g. entire encryption tables) based on the software context (e.g. before running AES), preventing an attacker from selectively observing lines accessed by a victim. However, this cannot scale to prefetch larger shared libraries and requires SW-rewrite to trigger such prefetch.
Tolerating Attacks by Restricting clflush
Flush+Reload attacks rely on the use of clflush instruction for flushing the OS-shared (read-only) lines. SHARP [19] advocates restricting the use of clflush in user mode on such read-only lines, and triggering a copy-on-write through the OS if such an operation is attempted. Unfortunately, such an approach requires changes to the ISA to redefine the clflush instruction, is not backwards compatible, and needs modifications to the OS to handle a trap when a clflush is issued to deduplicated pages from a user-space application. Ideally, we want to mitigate the attack in hardware, without any changes to the ISA or to the OS.
Tolerating Attacks with OS-based Solutions
Flush-based attacks can be avoided by disabling sharing of critical pages in OS (as in CATalyst [41] or Apparition [47] ) or replicating shared pages on concurrent access by multiple processes [15] . Unfortunately, such solutions give up the capacity benefits of page sharing and are not preferable.
Prior studies [17, 18, 48] have suggested using hardware performance counters or monitoring memory bandwidth [48] for anomaly detection. Unfortunately, such proposals suffer from false positives and are not universally applicable, as profile information is not always available for all applications.
Tolerating Attacks by Line Duplication
Recent work [20] evaluated NewCache [12] for Flush+ Reload attack by accessing NewCache with Process-ID and line-address. This creates duplicate copies of the line in the cache if two different processes concurrently access the same line, thereby preventing the flush of one process from evicting the line of another. Unfortunately, such in-cache duplication is incompatible with inter-process communication. Furthermore, NewCache requires storage for mapping tables and the OS to classify applications into protected and unprotected categories, to protect the mapping table from attacks.
DAWG [21] and MI6 [49] allocate a given number of ways and sets respectively per security domain. They protect against Flush+Reload attack by creating a duplicate of the OSshared line for each domain or disabling page-sharing across domains, thereby preventing the flush of one domain from evicting the line of another. Such a design places restrictions on inter-process communication through shared-memory, in that the communicating processes must be located in the same domain. Moreover, OS support is required for both cache allocation and for assigning processes to security domains. Ideally, we seek a hardware solution that does not require OS support and does not restrict inter-process communication.
CONCLUSION
Recent vulnerabilities exploiting cache side-channels have universally impacted the entire computing industry, underscoring the importance of mitigating them in next-generation hardware. In this paper, we investigate solutions for tolerating cross-core Flush+Reload attacks. Prior solutions for mitigating flush-based attacks require either rewrite of the application, or OS support, or changes to the ISA. Ideally, we seek a solution that can efficiently mitigate Flush+Reload attacks without requiring any changes to the software or the OS or the ISA, and while incurring negligible overheads.
In this paper, we propose a simple hardware mitigation of Flush+Reload cache attacks using Zombie lines -lines invalidated by a Flush, but with the tag and data still resident in the cache. Our solution is based on marking zombies on flushes and protecting them till such time they would have been naturally evicted. Our Zombie-Based Mitigation (ZBM) mitigates the attack in hardware by servicing zombie hits with the same latency as cache misses, thereby avoiding any timing leaks. Moreover, ZBM and its extension ZBMx (for niche applications with frequent flushes) avoid any slowdown for benign applications and incur a storage overhead of only 1-4 bits per cache line. Thus, ZBM is an effective, yet practical mitigation for cross-core Flush+Reload attacks.
