Since FPGAs are now available in datacenters to accelerate applications, providing FPGA hardware security is a high priority.
INTRODUCTION
Traditionally, FPGA industry had the position that hardware security of an FPGA was primary about protecting designs in terms of intellectual property (IP) in configuration data (i.e. the configuration Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). FPGA '20, February 23-25, 2020, Seaside, CA, USA © 2020 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-7099-8/20/02. https://doi.org/10.1145/3373087.3375390 bitstream) against cloning/overbuilding, reverse engineering, tampering, and spoofing, as summarized in [24] . This view has changed with FPGAs are now being integrated into data centers and cloud computing infrastructures at large scale [4, 10, 19] . One principle commonly used in cloud computing is resource pooling which allows sharing resources across different tenants such that overall utilization of cloud hardware resources gets improved. Resource pooling is currently not offered by any major FPGA cloud service provider, but multi-tenant scenarios are expected to provide better utilization and consequently better overall power efficiency at lower cost as compared to the current one-user-per-fabric scheme [25] . It should also be mentioned that the commonly used scenario consisting of a shell (i.e., the static system infrastructure that a data center FPGA provides to allow a user circuit communicating with the server) and the user accelerator design can already be considered multi-tenant. This is because the shell and a tenant are implemented individually and it needs protection mechanisms to ensure the system integrity of both (shell and user accelerator). For instance, a user accelerator should not be able to gain access to the shell, which in turn may compromise other parts of the cloud infrastructure.
The tutorial continues as follows: the next section provides a survey on FPGA hardware security, followed in Section 3 with a tutorial on how to research optimized ring oscillators for power hammering and for side-channel attacks. This section serves as a template to create other kinds of malicious circuits. Section 4 provides a tutorial on installing and using the open-source FPGA bitstream virus scanner FPGADefender. Some virus scan results are finally provided in Section 5.
HARDWARE THREATS FOR DATACENTER FPGAS
Due to their deep low-level programmability, FPGAs comprise new threat models far beyond of what is commonly known from conventional CPU/GPU systems. For instance, modules running on an FPGA may include circuits being able to measure system states at high accuracy which may open physical side-channels to leak sensitive data from other users [5, 21] . These kind of attacks are not available in known software threat models, but had been shown for FPGAs. In the reminder of this section, we take a brief literature review on potential threats against multi-tenant FPGAs which can be categorized into:
(1) attacks on the system availability (DoS-like attacks) (2) attacks on the user confidentiality (via physical side-channel analysis)
This section will also provide published state-of-the-art countermeasures. 
Attacks on the system availability
Denial-of-service-like (DoS-like) attacks are used to bring down active infrastructures and/or to compromise states in other system components which stay outside the scope of an attacking module, as illustrated in Figure 1 . At the electrical level, two means for DoS-like attacks had been utilized: short-circuits and power hammering. Short-circuits on modern FPGAs have been demonstrated in [1] within the multiplexers inside a switch matrix using a manipulated configuration bitstream resulting in a huge current increase (with several mA extra current for a single multiplexer). While the FPGA vendor tools ensure that generated bitstream are short-circuit free, an attacker can create shorts relatively easily. In fact, in [8] , shortcircuits had been used for obfuscating power traces from an AES core to make power analysis attacks much harder to perform.
Power hammering is another mechanism to carry out DoS-like attacks. All current power hammering attacks [7] are based on fast toggling circuits in order to draw a substantial amount of dynamic power. We will show in Section 3 that it is possible to implement ring oscillators running in the GHz frequency domain with a corresponding dynamic power footprint. In [7] , a grid of ring oscillators was activated at an adjustable rate (to stimulate resonance effects in the power supply regulation circuit). With this, several FPGA platforms such as Xilinx Virtex 6, Kintex 7, and Zynq-7000 FPGAs had been crashed (and in some cases requiring powercycling for bringing up boards back into service). In this tutorial, we will examine the potential for power hammering in more detail in Section 4. Although ring-oscillators are usually flagged with a warning by the vendor design tool flows and hence, not allowed to be deployed on any cloud or data center infrastructure, a recent research [6] has reported new ring-oscillator designs which can bypass such Design Rule Checking (DRC). A simple trick to bypass DRC is implementing a ring oscillator passing through an enabled transparent latch. With this, ROs could be deployed, for example, on Amazon F1 instances. Although all current power hammering attacks are leveraging self-oscilating circuits, glitch amplification can potentially be utilized for this purpose as shown in Figure 3 .
Attacks on the user confidentiality
Side-channel attacks on FPGAs can be either active (e.g., timing fault injection) or passive (e.g., power analysis, crosstalk coupling, electromagnetic analysis, and thermal channel leakage [16] ). In [14] , timing faults have been injected through a large number of ring oscillators to cause voltage drops followed by analyzing the resulting faulty cipher text using Differential Fault Analysis (DFA) for successfully revealing the secret key of a crypto-core. The idea of most timing fault injection attacks is to temporarily create a huge power demand (e.g., by starting a large number of ring oscillators). This will reduce the FPGAs supply voltage and may in turn slow down a path in a victim circuit such that it may fail timing.
Power analysis attacks have been demonstrated to leak the secret key of a cryptographic function that was running on the same FPGA [21] , running on a CPU embedded on the same FPGA die [27] , and running on a different FPGA on the same FPGA board [22] . All these attacks have in common that they use ring-oscillators to measure key-dependent fluctuations on the voltage. In addition to sensing voltage, self-oscillators can be used to monitor crosstalk effects [5, 6, 20] . In these studies, it was found that a long wire carrying a logical 1 will slow down a ring-oscillator that is implemented using an adjacent wire. Therefore, by taking advantage of the sensitivity of self-oscillators, attackers can leak the current state of a signal which is a concern in shared FPGA infrastructures.
State-of-the-art countermeasures
The main schemes to prevent side-channel power analysis attacks are based on masking and hiding strategies. In the masking strategy, an implementation of a cryptographic algorithm is transformed to another (typically larger) variant which is functionally equivalent, but where the new circuit is able to remain secure although the attacker can observe some details of the operation through a sidechannel, as proposed in [11] . This makes power analysis attacks much harder as the data leaked has also to be correlated with the implementation changing scheme used inside the secured core. On the other hand, the hiding strategy aims at lowering the Signal Noise Ratio (SNR) during the operation by either adding more sources of noise or lessening the strength of the signal, as suggested in [3, 8, 12, 26] . Ring-oscillators can be used to monitor the healthiness of an FPGA fabric [28] and can also detect voltage drop attacks (e.g., power hammering, power analysis) [18, 29] . A recent work has suggested to use ring-oscillators not only to monitor a power analysis attack but also to response against the attack by triggering more power noise [13] .
In a recent related work [15] , LUT-based ring-oscillator designs are detected directly from configuration bitstreams. While that work fundamentally showed that oscillator circuits can be detected from bitstreams, it was only shown for basic LUT-based oscillators. This leaves an attacker the chance to deploy alternative oscillator designs (e.g., based on glitch amplification). Furthermore [15] was implemented on a Lattice FPGA and those FPGAs are relatively small for building a multi-tenancy system. However, the vast majority of systems that would benefit from an FPGA virus scanner are based on modern FPGA architectures that are substantially more complex (e.g., fracturable LUTs, complex DSP blocks with ALU functionality, complex clock networks, a hierarchical routing fabric, etc.).
The following section will show how we use GoAhead to research the threat potential of ring-oscillators, while in Section 4 we show how viruses 1 can be detected with our new tool FPGADefender.
OPTIMIZING AND EVALUATING RING OSCILLATOR DESIGNS FOR POWER HAMMERING AND SPEED
For power hammering, an attacker obviously wants to maximize the amount of power a malicious circuit can waste per unit resources. For a side-channel attack respectively, the highest sensitivity is the most important objective, which correlates to the speed of an oscillator. In this section, we will use the tool GoAhead to design and tune ring oscillators to waste as much power as possible or to run as fast as possible. We use a setup on an Ultra96 Board where we precisely measured supply power. We use a Time-to-Digital Converter, as illustrated in Figure 4 , to measure the actual clock frequency of an oscillator. The FPGA on the Ultra96 board uses the same manufacturing process and the same UltraScale+ FPGA fabric architecture than what is provided in current Xilinx datacenter FPGA boards, like the popular Alveo U200/250 FPGA boards. The best design found for Ultra96 will then be tested at scale on an Alveo U200 board.
Time-to-Digital Converters on Xilinx UltraScale FPGAs
Ring oscillators on an FPGA can run in the GHz regime which is substantially faster than any user logic design can normally sustain. This makes the use of simple counters prohibitive to measure fastest possible oscillators. We therefore used a Time-to-Digital Converter (TDC) to measure speeds of oscillators. TDCs basically use propagation delay to measure a wave form. The idea of TDCs is to use different propagation delays from the probe to a set of flip-flops such that the parallel sampled flip-flops reveal the state of the probe at different points in time. See Figure 4 for an illustration on the operation of a TDC. The flip-flop sample clock speed can be selected mostly arbitrary (we used 100 MHz) as the variance in propagation delay is the key property that determines the characteristics of a TDC. Traditionally, TDCs had been implemented using carry chains to implement the delay chain. However, Xilinx UltraScale+ FPGAs do not have traditional carry chains, but use carry-look-ahead (CLA) circuits instead. For implementing a TDC on Xilinx UltraScale+ FPGAs, we consequently use local routing for the delay chain. Using this strategy, it is not important to find the fastest path (which we are normally interested in when implementing a module for performance), but instead finding paths that reach the different flops such that we form a TDC delay chain that has a linear characteristic (i.e. the variance in time between any pair of two consecutive flip-flops should be about the same) and high resolution (i.e. the absolute delay between any pair of two consecutive flip-flops should be small). This physical implementation problem is non-trivial and not directly supported by the Xilinx vendor tools.
We solved that problem using the path search function in the tool GoAhead [2] . GoAhead is a tool originally designed for implementing partially reconfigurable FPGAs. In its latest version, it uses Xilinx Vivado to report a device description that includes the entire architecture graph (including all possible switch matrix settings) as well as a detailed timing model. This device description is parsed in by GoAhead which allows this tool to report latencies for any path found from any arbitrary primitive port or a port inside a switch matrix. To automate processes in GoAhead, the tool supports TCL scripts. The TCL script in Figure 5 The results of the script is a set of paths found by a breadth-first search sorted for each path in the for-loop of the script by the number of hops (which correlates to the routing resources used), as shown in Figure 6 . Please note that the names used in GoAhead correspond to exactly the same naming scheme used by Xilinx in their Vivado tool suite. This holds for names used in scripts as well as for names used in results. Most importantly, GoAhead annotates the latency for each path. As can be seen in Figure 6 , GoAhead reports the time as it gets incremented along the path. With the PrintLatency switch in the GoAhead PathSearchOnFPGA function, a user can select between any SLOW_MIN, SLOW_MAX, FAST_MIN, or FAST_MAX timing corner to be considered in the timing analysis.
For building the TDC delay chain, the result paths are sorted by their latency (i.e. the latency reported for the last hop in each path) Figure 6 : Output created by the TCL script in Figure 5 . and we manually chose a set of paths that result in good linearity (≈ ±10ps) and a reasonably fine resolution (≈ 70ps). With N H I GH samples in a TDC for the measured high values and N LOW being the number of low samples, the speed of a RO is:
And with a ≈ 70ps resolution, this allows measuring RO speeds up to about 7GHz 2 . In order to get even more accurate resolution, we are reporting all values as the median of at least 10 000 runs.
Optimizing Ring Oscillators for Power Hammering and Speed
With having an FPGA system instrumented for accurate measurement of power and frequency measurement (through our TDCs), we explored and evaluated various FPGA ring-oscillator designs for their suitability for power hammering and for side-channel attacks (i.e., fastest oscillator speed). We used the GoAhead tool to find all possible ring oscillator designs. This uses again the GoAhead PathSearchOnFPGA command (which we used for designing the TDC in the previous section) by simply specifying the output of the LUT intended for the ring oscillator implementation for both the beginPort and the targetPort. An output of such a search is shown in Figure 7 . For each LUT, the path search will sort the result paths found in an order reporting the paths with the least number of hops first. These paths are typically the fastest ones and the reported latency serves as a sanity check. We used a GoAhead script (in the same way used for finding the TDC delay path) to find all fastest ring-oscillator designs over all LUTs in a CLB. We then implemented those paths for 2000 LUTs on the Ultra96 board and measured speed and power consumption. Our experiments found the fastest oscillator speed being 5.8GHz and an increase in power of 4.2W for the most malicious oscillator design found (see Figure 9 ). The experiments with the poorest results achieved only 1.1GHz speed and 1.7W waste power. This means that a single LUT has a waste power potential of 2.1mW when considering the most malicious oscillator design. To put this into perspective: an Alveo U200 data center card featuring a VU9P FPGA providing 1.182 million LUTs would have a waste power potential of over 2kW using the optimized power ring oscillator design. Consequently even a fraction of that logic would by far exceed the thermal and electrical specifications of any FPGA/FPGA board.
Xilinx Alveo U200 Power Hammering Experiment
We deployed the optimized ring-oscillator design from the previous paragraph on an Alveo U200 data center card. This board has the same specifications than the FPGA boards available with Amazon's F1 instances. We deployed 384000 ROs (≈ 32% of the available LUTs,   enable   I5   I4   I3   I2   I1   I0  RO_2   enable   RO_1   enable   RO_0   enable   RO_3   I5   I4   I3   I2   I1   I0   I5   I4   I3   I2   I1   I0   I5   I4   I3   I2   I1 as shown in Figure 8 ) and gradually enabled them to evaluate the critical point when the board crashes. Surprisingly, when reaching only 15% of the total LUTs resources (1182240 LUT6 primitives), it causes a strong drop in internal core voltage VCCI NT and eventually crashing the board when VCCI NT reached 0.74V . At that point, we already exceeded the 225W maximum power budged of the board. Figure 10 and Figure 11 show the power consumption, internal FPGA core voltage VCCI NT , and core temperature in relation to the activated ROs. Please note that the power was measured on the power supply grid with the help of an Ampere meter (True RMS). The used power supply was a Silverstone Strider 600W Modular SFX 80+ Gold Power Supply. The figures illustrate how dangerous malicious circuits could be in a data center setup. Therefore, it is necessary to prevent loading any bitstream onto an FPGA board that may include such malicious circuits.
Further GoAhead Use Cases
So far, we showed how the timing-driven path search in GoAhead can be used to find and optimize ring-oscillators. There are several other use cases that can benefit from this ability, in particular in the field of hardware security. For instance, in Figure 3 we showed how different routing latencies can cause glitches. This however, depends on the exact routing delays and by balancing latencies for all paths to the XOR gate shown in the example in Figure 3 , glitches can actually be canceled out. This is relevant for implementing DPAresistant circuits of cryptographic algorithms that often heavily use XORs. In such applications, balancing routing latencies may dramatically reduce power signatures that can be measured by a potential attacker (see also Figure 2 ). Vice versa, carefully imbalanced routing can be used for amplifying glitches (as needed for power-hammering). Other use cases include the design of asynchronous circuits and wave pipelining that rely on the implementation of exact (routing) latencies to function correctly.
In many cases, only a few signals are critical and they can be easily found by a path search in GoAhead together with a ranking of the results by latency. The paths selected can be directly implemented in the Xilinx Vivado tool through guided routing constraints (using the TCL command set property ROUTE). All remaining routing can then be added by Vivado automatically. By default, GoAhead uses a breadth-first search which means that the search essentially enumerates the entire search space. In practice, this is often acceptable because the depth of the search is rather limited (typically less than 10 hops in practical systems) and the Session: Morning Tutorial Session FPGA '20, February 23-25, 2020, Seaside, CA, USA adjacency of switch matrices is rather sparse. For longer paths, the GoAhead path search also supports a variant of A*.
FPGA VIRUS-SCANNING WITH FPGADEFENDER
Having examined the threat of ring-oscillators in previous sections, we are now looking closer into threat mitigation strategies. We will now introduce the tool FPGADefender which detects malicious constructs in bitstreams such that a system can reject a threat before it could even materialize on an FPGA. 
Overview
FPGADefender 3 is built entirely in Python which provides a bundle of supportive packages such as NetworkX [9] to represent and analyze an implementation graph from a bitstream. As a first step, an implementation graph is created by a netlist generator which contains node and edge information. This graph reassembles the netlist encoded inside the bitstream. The netlist generator is implemented as an enhancement to the tool Bitman 4 . The implementation graph is encoded in JSON format as shown in Figure 13 . After parsing the implementation graph, scanning options are parsed to provide inputs for the virus detector engine as well as filters. FPGADefender allows specifying a positive filter to describe configurations that must exist in the original bitstream (e.g., a specific connection through which a partially reconfigurable module communicates with the surrounding shell infrastructure). Correspondingly, a negative filter allows describing primitives and routing resources that are prohibited in a bitstream. In detail, the scanning process executes the following set of virus detector engines:
• Combinational cycle detector: Detect combinatorial cycles.
This includes detecting cycles that use transparent latches in order to prevent the attack revealed in [23] . • Attribute detector: Detect asynchronous design elements such as using latches. • Port detector: Detect prohibited ports. For example, this allows it to detect if a partial module tries leaking to a port not belonging to its allocated partial region. • Path detector: Detect prohibited paths. For example, detect if a partial module tries accessing a static route that is crossing a partial region (note that we explicitly allow static routes which are commonly used in complex designs). • Antenna detector: Detect dangling paths. This is in most cases rather a warning that a module may have an interface wire not properly connected.
• Short circuit detector: Detect short circuits caused by bitstream manipulation. In general, we detect any bitstream encoding that is invalid for routing. In Xilinx UltraScale+ FPGAs, this means in practice that all switch matrix multiplexers have to be one-hot encoded. • Fanout detector: Detect and report maximum fanout. This is an indicator for a malicious design as power-hammering needs some kind of high fan-out control in order to activate a larger number of ROs. However, this is just an indicator as an attacker could easily hide high fan-out signals. This is an interesting field for further research.
A score is given in each scanning stage and summed up to deliver a total score. Currently, FPGADefender is leaving the evaluation of the scores and the report to the user. However, our virus scanner performs already all the heavy-lifting scanning work. Based on the reported result, the configuration manager will ultimately be able to decide whether a bitstream is safe to be deployed or not, as shown in Figure 12 .
How to use FPGADefender
FPGADefender is a command-line program for scanning implemented FPGA designs (i.e. bitstreams) for malicious circuits and constructs. This section will describe installing and using FPGADe- The above command runs FPGADefender on the implemented graph given by the input_design.json file based on the options set in the config.ini file and outputs the results to the output.txt file.
The config file is used to configure FPGADefender and the tools it uses. The configuration file is parsed using the Python's ConfigParser package and therefore it consists of sections and options. The configuration file should have the following items specified:
• virus_signatures: Names of the virus signature packages to be executed -Specific virus_signature options described in the next section • connection_attributes: Optional section for adding attributes to connections -attributes_file: Path to the CSV file describing which connections get which attributes. • removables -connections_file -Path to a text file describing which connections should be removed from the implementation graph before the scans.
The different available virus signatures can be set up in the config file by adding the name of the virus signature class under the virus_signatures section, as shown in Table 1 . To build the executable, firstly a requirements file has to be set for the venv environment variable. With this, we can run: p i p i n s t a l l − r r e q u i r e m e n t s . t x t This will install the executables using the PyInstaller tool got from pip. When building the executable, we have to make sure to add the virus scanner packages given in the config file as hidden imports, as shown in the following example: p y i n s t a l l e r v i r u s s c a n n e r / __main__ . py −n v i r u s s c a n n e r −F −−hidden − i m p o r t = v i r u s s c a n n e r . p a r s i n g . s i g n a t u r e s . r i n g _ o s c i l l a t o r _ d e t e c t i o n
To add more than one signature, the .spec file can be modified.
SCAN RESULTS
We ran FPGADefender on a benchmark of malicious bitstreams and this section presents briefly the results. As a sanity check, we Session: Morning Tutorial Session FPGA '20, February 23-25, 2020, Seaside, CA, USA also ran scans on bitstreams that do not contain malicious circuits and FPGADefender had not reported any issue, except for one case: a true random number generator that actually uses ring-oscillators as a source of randomness. In detail we provide the following reports:
• Combinatorial loop and transparent latch detection are reported in Figure 15 . The file lists a couple of cycles detected. Each cycle starts with a status line stating the specific class of ring-oscillator. FPGADefender supports detecting ROs through LUTs, cascading multiplexers (MUX7/MUX8 in Xilinx notation), CLA carry logic, DSP blocks and latches. After this the entire first cycle of each class is reported. This can be identified by the first and last entry of each cycle pointing to the same node. • Short-circuits are reported in Figure 16 . This section reports first the number of short circuit situations found and then list for the first detected switch matrix multiplexer the input ports activated. Each switch matrix multiplexer can only connect to no port (if not used) or to at most one of its available inputs. • Latches are reported in Figure 17 . This section reports latches used in cycles but also all other latches which are not malicious, but which indicates that the bitstream was not implemented following good RTL design principles. • Antennas are reported in Figure 18 . The report lists the last port of an antenna which allows investigating the antenna issue using the Vivado tool suite. • Fan-outs are reported in Figure 19 . The fan-out report lists the nets with the highest fan-out in the design. The number of nets reported is specified in the config file.
CONCLUSIONS AND DISCUSSION
In this tutorial we provided a small survey on recent FPGA hardware security research and we revealed that in particular ringoscillators impose a real world threat. With this, we described how the academic tool GoAhead can be used to build a Time-to-Digital Converter for UltraScale+ FPGAs which was used for evaluating a larger number of ring-oscillator designs. This resulted in one design that has the enormous waste power potential of over 2kW on an Alveo U200 data center card and experiments on that board resulted in a power-induced crash using just 15% of the available LUT resources of the available VU9P FPGA. In the reminder of this tutorial, we showed how the open-source tool FPGADefender can detect (probably all kinds of) ring oscillator designs for mitigating this threat. The huge waste power potentials point out that hardware Trojans and other malicious circuits are a real threat and only very little logic is required to crash a system. We like to stress that this is not a vendor-specific problem and the threats discussed in this tutorial apply to any FPGA from any vendor. However, we also showed that malicious circuits can be detected automatically and that this is even possible at the bitstream level. We believe that security through some level of virus scanning is inevitably needed as part of an FPGA ecosystem. We also believe that such security tools can reliably solve any security issue and that even multi-tenancy in datacenters is well possible. For industry, the best way to address security challenges is by opening architectures, bitstreams, and tools in order to give the research community best possibilities to develop mitigation strategies. With this tutorial, we want to create awareness for FPGA security and stimulate research to ensure that FPGA security will be treated in a proactive manner.
ACKNOWLEDGMENTS
This work is kindly supported by the National Cyber Security Centre of the UK through the project rFAS -reconfigurable FPGA Accelerator Sandboxing (grant agreement 4212204/RFA 15971) and by the European Commission through the H2020 project EuroEXA (grants 754337).
We also thank the Xilinx University Program for tools and boards.
