University of Massachusetts Amherst

ScholarWorks@UMass Amherst
Doctoral Dissertations

Dissertations and Theses

April 2021

ADDRESSING SECURITY CHALLENGES IN EMBEDDED SYSTEMS
AND MULTI-TENANT FPGAS
Georgios Provelengios
University of Massachusetts Amherst

Follow this and additional works at: https://scholarworks.umass.edu/dissertations_2
Part of the Computer and Systems Architecture Commons, Digital Circuits Commons, Electrical and
Electronics Commons, Hardware Systems Commons, and the VLSI and Circuits, Embedded and Hardware
Systems Commons

Recommended Citation
Provelengios, Georgios, "ADDRESSING SECURITY CHALLENGES IN EMBEDDED SYSTEMS AND MULTITENANT FPGAS" (2021). Doctoral Dissertations. 2132.
https://doi.org/10.7275/20484053 https://scholarworks.umass.edu/dissertations_2/2132

This Open Access Dissertation is brought to you for free and open access by the Dissertations and Theses at
ScholarWorks@UMass Amherst. It has been accepted for inclusion in Doctoral Dissertations by an authorized
administrator of ScholarWorks@UMass Amherst. For more information, please contact
scholarworks@library.umass.edu.

University of Massachusetts Amherst

ScholarWorks@UMass Amherst
Doctoral Dissertations

Dissertations and Theses

ADDRESSING SECURITY CHALLENGES IN EMBEDDED SYSTEMS
AND MULTI-TENANT FPGAS
Georgios Provelengios

Follow this and additional works at: https://scholarworks.umass.edu/dissertations_2
Part of the Computer and Systems Architecture Commons, Digital Circuits Commons, Electrical and
Electronics Commons, Hardware Systems Commons, and the VLSI and Circuits, Embedded and Hardware
Systems Commons

ADDRESSING SECURITY CHALLENGES IN
EMBEDDED SYSTEMS AND MULTI-TENANT FPGAS

A Dissertation Presented
by
GEORGIOS PROVELENGIOS

Submitted to the Graduate School of the
University of Massachusetts Amherst in partial fulfillment
of the requirements for the degree of
DOCTOR OF PHILOSOPHY
February 2021
Electrical and Computer Engineering

c Copyright by Georgios Provelengios 2021
All Rights Reserved

ADDRESSING SECURITY CHALLENGES IN
EMBEDDED SYSTEMS AND MULTI-TENANT FPGAS

A Dissertation Presented
by
GEORGIOS PROVELENGIOS

Approved as to style and content by:

Russell Tessier, Chair

Wayne Burleson, Member

Daniel Holcomb, Member

Sunghoon Ivan Lee, Member

Tilman Wolf, Member

Christopher V. Hollot, Department Head
Electrical and Computer Engineering

ACKNOWLEDGMENTS

In my doctoral journey, I have been fortunate to be surrounded by so many intelligent and capable people whose continuous help helped me reach the end of it. I
would like therefore here to express my gratitude to them.
First of all, I would like to express my sincere gratitude to my advisor Professor
Russell Tessier for giving me the chance to learn how to do quality research at his
distinguished research group. Thank you for your continuous support and mentorship
and for sharing your immense knowledge and experience with me. I also sincerely
thank you for giving me so many opportunities for career development during my
studies. Besides my advisor, I am also deeply grateful to Professor Daniel Holcomb.
Thank you for your guidance, support, and, encouragement, and for generously devoting your energy and time to this thesis.
I would like to thank the members of my dissertation committee for providing
feedback that helped me shape the research goals of this thesis. I would like to
especially thank Professor Tilman Wolf for his valuable guidance, ideas, and time
during the first part of the work.
During my doctoral journey, I had also the privilege to work closely with highly
talented and knowledgeable people, good friends now, including Arman Pouraghily,
Xuzhi Zhang, and Chethan Ramesh. I sincerely thank you for all your help and
contributions to this work. I would also like to give a special thanks to my labmates
Naveen Kumar Dumpala, Shivukumar Basanagouda Patil, Naren Prabhu, Zeqi Qin,
Shayan Moini, Aiden Gula, Omkar Kavitkar, and Lijuan Xia for their selfless help
and constant support.

iv

ABSTRACT

ADDRESSING SECURITY CHALLENGES IN
EMBEDDED SYSTEMS AND MULTI-TENANT FPGAS
FEBRUARY 2021
GEORGIOS PROVELENGIOS
B.Sc., TECHNOLOGICAL EDUCATIONAL INSTITUTE OF WESTERN
GREECE
M.Sc., NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS
Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST
Directed by: Professor Russell Tessier

Embedded systems and field-programmable gate arrays (FPGAs) have become
crucial parts of the infrastructure that supports our modern technological world.
Given the multitude of threats that are present, the need for secure computing systems
is undeniably greater than ever. Embedded systems and FPGAs are governed by
characteristics that create unique security challenges and vulnerabilities.
Despite their array of uses, embedded systems are often built with modest microprocessors that do not support the conventional security solutions used by workstations, such as virus scanners. In the first part of this dissertation, a microprocessor
defense mechanism that uses a hardware monitor to protect application-level and
embedded operating system execution is presented. The monitoring system is placed
adjacent to the embedded processor and observes each instruction during execution

v

to ensure correct program operation. Our hardware-based processor monitoring system is shown to prevent an impending control-flow hijack attack within a single clock
cycle. The approach is demonstrated using a hardware prototype based on a LEON3
processor running a Linux operating system. The monitoring system does not degrade
processor performance nor require processor software or hardware modifications.
As FPGAs have grown in logic capacity, their range of application domains has
expanded beyond embedded systems to include cloud computing. This growth has
led to scenarios in which circuits from multiple designers are deployed in an FPGA
at the same time. FPGA multi-tenancy introduces unique security challenges that
must be addressed. Co-located FPGA users share device resources at the physical
level including wiring and the power distribution network (PDN). These limitations
make complete user-level isolation impossible for current commercial FPGA devices.
The second part of this dissertation focuses on two important classes of attacks
that are based on this shared use of the FPGA resources, crosstalk-based information
leakage attacks and on-chip voltage attacks. Crosstalk coupling that exists between
long wires in an FPGA routing channel can be used by an adversary to steal secret
information from an unsuspecting FPGA co-tenant. Similarly, a malicious tenant can
deliberately cause voltage fluctuations in the FPGA PDN in an attempt to induce
timing faults in a neighboring circuit. In both cases, the attacks require no physical
access to the device and can be performed remotely. The work fully characterizes the
threats and demonstrates strategies that can be used to protect multi-tenant users
from potential attacks.

vi

TABLE OF CONTENTS

Page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1
1.2

Security Challenges in Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Security Challenges in Multi-Tenant FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1
1.2.2

1.3
1.4

Crosstalk Leakage Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
On-chip Voltage Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Published Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2. BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1

Hardware-based Processor Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1
2.1.2
2.1.3

2.2
2.3

Basic Operation of Per-Instruction Hardware Monitors . . . . . . . . . 12
Extraction of the Monitoring Graph . . . . . . . . . . . . . . . . . . . . . . . . . 14
Hardware Monitoring Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 15

FPGA Crosstalk Leakage Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
On-FPGA Voltage Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1
2.3.2
2.3.3

Multi-tenant FPGA Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
FPGA Voltage Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
FPGA Voltage Attack Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vii

2.3.4
2.3.5
2.3.6

FPGA Voltage-Induced Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
FPGA Voltage Attacks on Encryption Cores . . . . . . . . . . . . . . . . . . 23
FPGA Voltage Attack Remediation . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3. HARDWARE-BASED MONITORING FOR PROTECTING
SYSTEM CALLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1
3.2

Monitoring Linux System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Architecture of the Monitoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1
3.2.2
3.2.3

3.3

Prototype System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1

3.4

Attack Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1
3.4.2

3.5

Basic Monitor Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Enabling and Disabling Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Handling Nested System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Attack Detection and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Monitoring System Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4. CHARACTERIZATION OF FPGA LONG WIRE DATA
LEAKAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.1

4.2
4.3

Recovering Channel Layout Through Measurement . . . . . . . . . . . . . . . . . . . 44
Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1
4.3.2
4.3.3

4.4

Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Susceptibility of Each Wire to Leakage . . . . . . . . . . . . . . . . . . . . . . . 45
Comparing Different Long Wire Types . . . . . . . . . . . . . . . . . . . . . . . 47
Technology Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5. CHARACTERIZATION OF FPGA PDN ATTACKS . . . . . . . . . . . . . 52
5.1

Methodology and Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.1
5.1.2

Characterized FPGA PDNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
On-die Voltage Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

viii

5.1.3
5.1.4
5.2

Voltage Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Minimizing Temperature Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

On-chip FPGA PDN Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1
5.2.2

Adversarial Power Consumption Circuit . . . . . . . . . . . . . . . . . . . . . . 57
Physical Characterization of Voltage Drop . . . . . . . . . . . . . . . . . . . . 60
5.2.2.1
5.2.2.2

5.3

Varying the Amount of Power Consumed . . . . . . . . . . . . . 62
Role of the Inductor in Undershoot . . . . . . . . . . . . . . . . . . 62

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6. CAUSING FAULTS VIA PDN MANIPULATION . . . . . . . . . . . . . . . . 66
6.1
6.2
6.3
6.4

Demonstration of Path Delay Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Relating Voltage and Timing Slack to Fault Sensitivity . . . . . . . . . . . . . . . 71
Relationship to FPGA Logic Isolation and Active Fencing . . . . . . . . . . . . . 73
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7. EXPLOITING VOLTAGE DROPS FOR SECURITY
BREACHES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1
7.2
7.3
7.4

RSA Cryptosystem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Fault Injection Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8. POWER WASTERS FOR CLOUD FPGA ATTACKS . . . . . . . . . . . . . 82
8.1

Power Wasting Logic Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.1
8.1.2
8.1.3

8.2

Evaluation of Power Wasting Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.2.1
8.2.2
8.2.3

8.3

RO- and Shift Register-based Power Wasters . . . . . . . . . . . . . . . . . . 83
Exploiting Glitch Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
AES-based Power Waster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Power Waster Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Fault Generation with AES-based Power Wasters . . . . . . . . . . . . . . 92

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9. MONITORING SYSTEM FOR PDN ATTACKS . . . . . . . . . . . . . . . . . 95
9.1

Localizing Voltage Droops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
ix

9.1.1
9.1.2
9.1.3

Monitor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Attack Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Stratix 10 Evaluation using an AES Power Waster . . . . . . . . . . . . 101
9.1.3.1

9.2

On-chip Monitoring and Attack Throttling . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2.1

System Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2.1.1
9.2.1.2
9.2.1.3

9.2.2
9.2.3
9.2.4
9.3

Minimizing the Number of Sensors . . . . . . . . . . . . . . . . . 101

Clock Regions and Sensors . . . . . . . . . . . . . . . . . . . . . . . . 107
Threshold Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Attack Detection and Remediation . . . . . . . . . . . . . . . . . 109

Preventing Board Crashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Using the ARM-HPS for Processing Sensor Data . . . . . . . . . . . . . 112
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

10.CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.2.1 Microprocessor Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.2.2 Multi-tenant FPGA Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

x

LIST OF TABLES

Table

Page

2.1

Related Work on Hardware Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1

Monitoring graph sizes for four Linux system calls . . . . . . . . . . . . . . . . . . . 33

3.2

Resource utilization of the hardware monitor and LEON3
processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1

Correspondence between indices of Figure 4.3 and physical resources
on the target Cyclone IV device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1

Power consumed by each RO-based power waster instance. . . . . . . . . . . . . 59

7.1

Resources used in RSA core and corresponding reported Fmax for the
three supported key lengths in the Cyclone V device. . . . . . . . . . . . . . . 78

8.1

Resources used in AES-based waster and corresponding reported
Fmax in Cyclone V, Arria 10, and Stratix 10 devices. . . . . . . . . . . . . . . 89

8.2

Power increase per BLE for the five power wasting designs shown in
Figures 8.1 and 8.2, in Cyclone V, Arria 10, and Stratix 10
devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.1

Resources used in voltage monitoring network for various numbers of
sensors for the three selected devices (Cyclone V SE
(5CSEMA5F31C6), Arria 10 GX (10AX115N2F45E1SG), and
Stratix 10 SX (1SX280HU2F50E1VG)). . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.2

Attack scenarios used for evaluating the monitor network with the
RO-based power waster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.3

Absolute error expressed in LABs of the Euclidean distance between
the mean of the 100 location predictions of a given sensor subset
and the mean of the 100 location predictions obtained when all
218 sensors are used (see the bottom, rightmost subplot of
Figure 9.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
xi

9.4

FPGA resources used for the NIOS II-based on-chip monitoring
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

9.5

FPGA resources used for the ARM-HPS-based on-chip monitoring
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xii

LIST OF FIGURES

Figure

Page

2.1

A simplified block diagram showing an embedded processing system
placed alongside a hardware monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

During the offline analysis, the binary code of the application to be
monitored is analyzed in order to derive the monitoring graph. . . . . . . 15

2.3

Ring oscillator (receiver) adjacent to a victim wire (transmitter)
mapped to a Cyclone IV FPGA device. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4

Ring oscillator count values when the victim transmits logic 1s or 0s
in an Arria 10 GX (10AX115N2F45E1SG) FPGA device. . . . . . . . . . . 18

3.1

System architecture for a hardware monitor that supports selective
system call monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2

Console output showing that the attack script changes the test
account privilege from a normal user to root. . . . . . . . . . . . . . . . . . . . . . 34

3.3

Waveforms showing normal execution of the system call (top) and
detection of the attack (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1

Experimental framework for evaluating long wire delay effects on
SRAM FPGAs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2

Figure shows measured receiver frequency for the same transmitter
and receiver wires when measured with two different length ring
oscillators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3

Figure shows the measured value of ∆t per C4 wire segment for all
pairs of wires in a C4 channel on a Cyclone IV device. . . . . . . . . . . . . . 46

4.4

The distribution of observed ∆t for each wire in the channel when the
wire is used as a receiver and its neighbor is used as a
transmitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xiii

4.5

Measured values of ∆t versus length of wire for three different
devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6

Values of ∆t/LAB observed using two different wire types on three
different devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7

Values of ∆t/LAB when normalized to a wire length that matches the
height of a LAB on a 60 nm technology Cyclone IV device. . . . . . . . . . 50

5.1

On-chip FPGA power system. A voltage drop occurs across the
inductor due to di/dt. A steady-state voltage drop occurs in the
PDN due to its resistance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2

Schematic of the RO-based voltage sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3

Figures show the experimentally derived Cyclone V and Arria 10
calibration curves, which relate frequency changes to the supply
voltage values that account for them. The frequency of a sensor is
inversely proportional to the propagation delay of the oscillating
signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4

Power waster circuit mapped to a Cyclone V / Arria 10 / Stratix 10
adaptive logic module (ALM) device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.5

Normalized RO sensor counts (left axis) and their corresponding
voltages (right axis) measured by sensors before and during a
power wasting attack that begins at time 0. The legend shows the
distance between each sensor and the center of the power wasting
region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.6

Voltage change across distance for various number of power wasters
instantiated in the Cyclone V device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.7

Voltage drop from the inductor in DE1-SoC (Cyclone V), measured
at test pad VCC1P1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.8

Voltage versus sensor position on the Cyclone V chip. . . . . . . . . . . . . . . . . 64

6.1

Delay faults on adder circuits placed outside the wasting area when
the adversary at time 0 turns on 12,000 and 28,160 power wasters
in Cyclone V and Arria 10 devices, respectively. X-coordinate
denotes the time the fault occurred during the attack.
Y -coordinate is the reported timing slack of the exercised path. . . . . . 68

xiv

6.2

Turning on power waster circuit causes a large instantaneous change
in current. The instantaneous change causes a voltage drop on the
off-chip inductor which effects all parts of the chip. . . . . . . . . . . . . . . . . 69

6.3

Examining timing faults at different distances between the adder and
center of attack in Cyclone V and Arria 10 devices. . . . . . . . . . . . . . . . 70

6.4

Scatter plot shows which randomly generated attack scenarios caused
faults and which did not. X-coordinate denotes voltage in victim
circuit during attack. Y -coordinate is the reported timing slack of
path exercised during attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.1

Stack plot shows the probability of different outcomes when attacking
RSA using various numbers of wasters instantiated in the Cyclone
V device. Successfully extracting the RSA private key constitutes
the blue part of the plot. Unwantedly resetting the board due to
the attack constitutes the orange part of the plot. . . . . . . . . . . . . . . . . . 80

8.1

Designs in (a), (b), (c), and (d) show the three RO-based wasters
used to dissipate dynamic power in Cyclone V / Arria 10 / Stratix
10 devices. Design (e) shows the shift register-based waster. . . . . . . . . 84

8.2

Design in (a) shows our unrolled waster based on glitching that uses
copies of AES encryption rounds. (b) shows the structure of a
standard 128-bit AES round used in our design. . . . . . . . . . . . . . . . . . . 87

8.3

Power consumption while increasing the number of chained 128-bit
AES rounds in Cyclone V, Arria 10, and Stratix 10 devices. . . . . . . . . 91

8.4

Causing delay faults on adder circuits placed outside the wasting area
when the adversary at time 0 turns on 20 and 58 128-bit unrolled,
chained AES rounds clocked at 50 MHz in Cyclone V and Arria
10 devices, respectively. X-coordinate denotes the time the fault
occurred during the attack. Y-coordinate is the reported timing
slack of the exercised path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.1

Map of voltage contours on chip during power attacks, reconstructed
from sensor data. Purple rectangle denotes location of the
attacker’s power waster circuits. Orange/red rectangles are the
sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

9.2

Marks represent predicted center of attack based on a randomly
selected subset of sensors. Each subplot contains 100 points. . . . . . . . 100

9.3

Activation of the 95-round AES-based waster in Stratix 10. . . . . . . . . . . 102
xv

9.4

Voltage change across distance for the RO- and AES-based
wasters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9.5

Locating the attacker with the minimum number of sensors required.
Black marks represent the predicted center of attack area based
on a randomly selected subset out of the total 218 sensors. Each
subplot contains 100 points. Note that although the predictions
converge to a specific location when all the 218 sensors are used
(lower-right corner of figure) the center of the disruption and the
center of the attacker area may not coincide. . . . . . . . . . . . . . . . . . . . . 104

9.6

On-chip monitoring system overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

9.7

The figure illustrates a partitioning of an area spanning 65,536 LABs
(256 rows by 256 columns) into four clock regions that can be
used by multi-tenant applications. Each region contains nine
RO-based sensors. Their relative locations on the device are
indicated by colored rhombuses. The four regions span roughly
60% of the FPGA logic area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

9.8

Intel SignalTap waveforms capturing the start and end of a voltage
attack attempting the crash the FPGA board using the 95-round
AES-based waster and prevented by the on-chip monitoring
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

xvi

CHAPTER 1
INTRODUCTION

Embedded systems that include microprocessors are currently deployed to address
a broad range of computing needs. As their deployment profile has become larger, new
security challenges have emerged. The limited physical resources of typical embedded
systems make their protection a particularly challenging task [68, 69]. For example,
ensuring that executing software has not been altered is difficult since powerful virus
scanning software cannot often be used. The alteration of program control flow
by an attacker can lead to a system crash or allow a user to gain control of an
embedded system. In the first portion of this dissertation, a monitoring-based defense
mechanism to protect modern embedded microprocessors and operating systems is
presented. The solution uses hardware to validate correct program execution at runtime. No software or hardware modifications to the microprocessor in the embedded
system are needed.
FPGAs are now widely used in a broad range of embedded and cloud computing
environments for network functions [88], data search [13], and video processing [6].
While FPGA logic designs have traditionally been created by a single team of designers for dedicated single user deployment, contemporary FPGA logic design is
considerably more complex. Embedded FPGA designs often contain multiple intellectual property (IP) cores created by a variety of vendors [46]. In cloud FPGA
deployments, users share the FPGA substrate with support circuitry created by a potentially untrusted cloud vendor [4,6]. Although current cloud vendors typically limit
FPGA usage to a single client at a time, the size and cost of FPGAs invites simulta-

1

neous device sharing across multiple untrusting cloud users to achieve economies of
scale [54]. These three multi-tenant use cases pose a serious challenge to the security
of current FPGA devices. The second portion of this dissertation addresses security
for multi-tenant FPGAs.
Chapter 3 describes a defense mechanism for secure embedded operating systems.
Chapters 4 through 9 focus on security challenges for FPGA multi-tenancy and potential countermeasures.

1.1

Security Challenges in Embedded Systems

The Internet of Things (IoT) represents the convergence of cyber-physical systems,
which control physical processes, and the Internet, which provides global interconnectivity for access to data systems. Embedded systems are at the core of any IoT
solution as they provide the necessary computational power at the location where devices interact with the physical world. Due to their deployment in the environment,
these embedded systems are typically constrained in their computational resources
(performance and/or energy) but still connected to a network to interact with the
other components of the IoT solution. This type of networked embedded system experiences a particularly challenging problem when it comes to security, specifically,
protection from attacks on the embedded operating system. The network connectivity
provides an attack vector to the system and the system performance or energy resources are insufficient to run conventional defense mechanisms, such as virus scanners
and malware detection software, that provide protection on conventional computers.
An effective defense mechanism that has been developed in related work [50,66] is
“hardware monitors”. These monitors are logic components that are co-located with
the embedded system processor core and track the execution of software. Hardware
monitors require no change or addition to the software that is run on the processing
system. Instead, such monitors verify that a processor executes a piece of software

2

faithfully by comparing two pieces of information: 1) processing steps reported by the
processor at runtime and 2) a model of what is considered correct execution of the
software that is to be executed. Attacks that hijack the processor inherently cause
the processing to deviate from the model of the original software and thus can be
detected.
Although hardware-based processor monitoring is a proven-effective approach [1,
9, 28, 50], the existing solutions are unsuitable for protecting state-of-the-art systems
containing millions of instructions. Fine-grained monitoring becomes a challenge as
the hardware overhead of a solution grows proportionally to the code size of the
underlying application. Consequently, monitors have only been used for protecting
applications running bare-metal on the processor [83] or highly constrained, simplistic real-time operating systems [58, 81]. Modern embedded applications, however,
are typically developed on advanced platforms capable of satisfying their need for
sophisticated services (e.g., multitasking/multithreading, process and memory management, etc.).
This dissertation advances earlier work by introducing a hardware monitoring
system that works with Linux, a common, widely-used operating system, whereas previous efforts have either looked at specific applications running directly on the processor [83] or highly constrained, simplistic embedded operating systems [58, 81]. In
addition, we show that our system defends against real, practical attacks (in our case
the CVE-2013-1828 vulnerability, which has a known exploit), whereas previous work
has shown defenses against attacks exploiting synthetically crafted vulnerabilities.
The main idea of our work is to focus the monitoring system on the portions of the
operating system that are particularly vulnerable. Since many vulnerabilities and associated exploits occur in the context of system calls, we have designed our hardware
monitor to track their processor operations at a very fine granularity. By verifying
operation at the level of an individual processor instruction, we can detect any devi-

3

ation (i.e., attack) almost instantaneously. By limiting the monitoring to a fraction
of the operating system code (i.e., system calls) and not the entire code base, we
can achieve low overhead compared to other hardware monitoring approaches. This
combination of sensitivity to attacks on vulnerable code and low hardware overhead
(and no modification to any software) provides a promising approach to protecting
embedded systems in the IoT domain and elsewhere.

1.2

Security Challenges in Multi-Tenant FPGAs

Over the past decade, cloud computing has grown to provide computation resources and services on demand for the masses over the Internet. The worldwide
cloud computing market is expected to grow by 17% in 2020 and is projected to
reach $354.6 billion in revenue in 2022 [20]. The flexibility and energy efficiency of
FPGAs has motivated cloud providers to employ them as a general purpose computing resource in the cloud. Microsoft Catapult [65] and Azure [55], Amazon EC2 F1
instances [6], and Alibaba Cloud [4] are a few examples of FPGA-based cloud computing platforms. Although cloud FPGAs are generally dedicated to a single user
at a time, the next logical step for cloud providers is simultaneous FPGA substrate
sharing by multiple users [35, 86]. While FPGA multi-tenancy provides a mechanism for maximizing the utilization of FPGA resources, it presents unique security
challenges and a diverse attack surface.

1.2.1

Crosstalk Leakage Attacks

In this dissertation, we focus on two classes of attacks targeting multi-tenant
FPGAs, crosstalk leakage and on-chip voltage attacks. Recent work with multi-tenant
FPGAs [21, 67] has shown crosstalk coupling between neighboring long wires can
be used to extract sensitive information from the wiring of a victim’s circuit. In
this scenario, the transmitter wire (victim) is part of the co-located circuit of the

4

unsuspecting tenant. The receiver wire (attacker) is part of the circuit the malicious
tenant uses to sense the transmitted logic values of the victim wire. A binary counter
triggered by a ring oscillator (RO) is employed to sense the transmitted logic values.
The logic value the victim wire transmits can affect the period of the RO and thus
is reflected in the values of the binary counter. The effect is shown to be consistent
across a variety of transmitter clock frequencies and device locations. Additionally,
the attack can be performed remotely requiring no physical access to the FPGA device
or prior knowledge of the placement location of the victim circuit. In Chapter 4, this
issue is fully explored.
In Chapter 4 we also describe the use of crosstalk as a guide in deriving the
adjacency map of an FPGA routing channel which often is not publicly disclosed by
the FPGA vendor. This information can help designers protect sensitive signals by
devising design isolation techniques or not allowing the shared use of a single channel.
We also examine how crosstalk effects change across different technology nodes. The
findings of this dissertation show that leakage exists and is significant across all the
FPGAs tested. The list below summarizes the most important contributions of the
dissertation regarding crosstalk leakage attacks.
1. A new method for accurately quantifying the amount of crosstalk coupling that
exists between neighboring long wires in SRAM-based FPGA routing channels.
2. A robust method for determining routing adjacency for FPGA channel wires
using crosstalk effects as a guide.
3. Characterization of the crosstalk effect in different types of FPGA routing channel long wires across three different Intel FPGA families (Cyclone IV, Stratix
V, and Arria 10).

5

1.2.2

On-chip Voltage Attacks

Previous research has shown that FPGA supply voltage manipulation can cause
circuit timing faults [3, 39, 48, 59], and device reset [26]. Early work showed that
overaggressive manipulation of the power supply for FPGA dynamic voltage scaling
leads to delay faults [2, 15]. In the multi-tenant FPGA case, a malicious tenant may
spontaneously cause the FPGA supply voltage to drop in an attempt to induce delay
faults in another tenant’s circuit. In many multi-tenant scenarios, the attacker does
not need physical access to the FPGA to perform this type of attack due to FPGA
network access, enhancing the threat [34]. In addition, all current commercial FPGAs
contain a single power distribution network (PDN) for each supply voltage making
on-chip supply voltage isolation impossible. In this work, we characterize the threat
posed by on-chip voltage attacks in multi-tenant FPGAs and examine a low-overhead
approach to detect such attacks.
Unlike multicore microprocessors and graphics processors, FPGAs allow users to
craft a broad range of computing circuits with arbitrary functionality. Although
many circuits that deliberately waste power can be detected via netlist or bitstream
scans [25], the spectrum of power wasters is continually evolving, not unlike software
viruses that attack personal computers. In this dissertation, we describe and evaluate
a new power waster circuit that is difficult to detect via scanning since it appears
similar to standard design logic. This circuit is easy to implement and consumes
significant power by deliberately propagating signal glitches through circuitry with
high fanout and minimal logic masking. We show that its deployment on a Stratix
10 device can lead to board reset.
Recent research [25,59,91,93] has highlighted the importance of developing active
defense strategies to mitigate voltage attacks in FPGA devices. These approaches
include off-chip tracking of FPGA power consumption and on-chip voltage monitoring
[59, 93]. Several techniques that use on-chip sensors to diagnose aggressive on-FPGA

6

power consumption behaviors have been examined [25, 59]. In this dissertation, we
advance the application of such sensor-based solutions by using voltage values from
a low-overhead, integrated on-chip sensing system to enable the real-time detection
and mitigation of on-chip voltage attacks. The contributions of the dissertation to
on-chip voltage attacks can be summarized as follows:
1. We explore the on-chip voltage response to power wasters, circuits that deliberately waste power, at locations across the die. These voltage responses over time
are compared with simultaneous off-chip voltage measurements for three Intel
FPGA families. We characterize the voltage responses based on the distance
from, and power consumption of, the power waster circuits of the attacker.
2. We evaluate the ability of power wasting circuits located in one part of the
die to induce timing faults in user circuits situated at locations across the die.
Faults in paths with a range of slack values are considered.
3. We show that power wasting circuits using a small amount of logic (e.g., thousands of logic blocks) can be used to extract the key from an RSA crypto circuit.
Unlike previous approaches, our attack does not require any modifications to
the encryption core, nor power wasting that is synchronized with the execution
of specific rounds of the encryption operation.
4. We examine a new stealthy power wasting circuit that can be easily implemented
in FPGA logic. The design includes standard synchronous logic clocked by a
global clock, making it difficult to identify by compile-time scanners looking for
malicious design circuitry.
5. We examine the use of a network of small on-chip voltage sensors to identify the
location of an attack on the FPGA dies. Tradeoffs in accuracy versus numbers
of sensors are evaluated.

7

6. Finally, we propose and develop a practical system to mitigate voltage attacks
in real time by deactivating clock signals in an FPGA logic region suspected
by our sensing system of containing malicious circuitry. We demonstrate that
our sensing and mitigation system can successfully defend a Stratix 10 against
a board reset attack using power wasters.
Our approaches are evaluated in FPGA hardware under typical operating conditions. DE1-SoC [77], DE5a-Net [78], and DE10-Pro [79] boards, containing Intel
Cyclone V, Arria 10, and Stratix 10 SX FPGAs, respectively, are used to evaluate the
effects. To characterize the voltage effects in the PDN from the activation of power
wasters, a series of experiments were performed using portions of available on-chip
logic. Our experiments show that voltage drops caused by inductance (L(di/dt)) can
be used to create fault attacks in the Cyclone V and Arria 10 devices that can even
target tenants located far from the power wasting area. These attacks are shown
to straightforwardly allow the determination of an RSA encryption key. To address
the possibility of power wasting attacks by adversaries, we introduce a monitoring
approach using FPGA logic to identify attackers attempting to deploy power wasting
circuits.

1.3

Thesis Organization

The remainder of the document is organized as follows:
The background for the topics discussed in this dissertation and a review of relevant literature are provided in Chapter 2. In Chapter 3, a hardware-based processor
monitoring approach for protecting embedded operating systems is presented. The
monitoring approach is demonstrated in hardware using a LEON3 processor prototyped on an Intel Stratix IV FPGA. The effectiveness of the system is evaluated using
a publicly available software attack that exploits a known vulnerability of the Linux

8

system-call interface. The evaluation shows that the proposed monitoring system
prevents the attack while its presence does not interfere with processor operation.
In Chapter 4, we investigate the distinct characteristics of signal crosstalk. The
chapter introduces a mechanism to characterize the crosstalk coupling that exists
between neighboring FPGA long wires at the femtosecond scale. Then, it shows that
it is possible to reverse engineer channel layouts by determining which pairs of routing
resources/links in the channel exhibit coupling to each other even if this information
is not provided by the FPGA vendor. To fully characterize these effects, long wire
coupling on different types of wires across three FPGA devices (Cyclone IV, Stratix
V, and Arria 10) implemented in CMOS technology nodes ranging from 65 nm to
20 nm is examined. Through experimentation, it is demonstrated that information
leakage is apparent for all three FPGA families.
Chapters 5 through 9 focus on on-FPGA voltage attacks. In Chapter 5, techniques
for supply voltage manipulation are investigated. Chapter 6 investigates the ability
for the aggressive power consumption of one tenant’s FPGA circuit to cause delay
faults in another tenant’s application on the same FPGA. To illustrate the risks
involved, in Chapter 7, an RSA encryption key extraction attack is performed by
introducing delay faults in hardware via voltage manipulations. This attack does not
require modification to the encryption core nor require attack activation synchronized
with specific encryption operations.
In Chapter 8, a variety of circuit power wasting techniques that typically are
not flagged by design rule checks imposed by FPGA cloud computing vendors is
evaluated. In addition, a new power waster design is proposed that is based on a multistage circuit and performs standard logic operations. The new waster induces delay
faults in co-located circuits. The efficiency of five power wasting circuits, including
the proposed design, is evaluated in terms of power consumed per logic resource.
In Chapter 9, strategies to identify power manipulation using low-cost monitoring

9

circuits that can locate the source of an attack are highlighted. An on-chip monitoring
approach that can be used to mitigate the attack in real-time by suppressing the clock
signal for the target region is proposed. The approach is fast enough to prevent an
attack that attempts to cause a board reset. Chapter 10 concludes the dissertation by
summarizing the contributions of the work and offers directions for future research.

1.4

Published Results

The research work and results presented in this dissertation have been published
or submitted for publication in the following peer-reviewed conference and journal
articles:
1. Provelengios, George, Pouraghily, Arman, Tessier, Russell, and Wolf, Tilman.
A hardware monitor to protect Linux system calls. In IEEE Computer Society
Annual Symposium on VLSI (ISVLSI) (2018), pp. 551–556. [63]
2. Provelengios, George, Ramesh, Chethan, Patil, Shivukumar B, Eguro, Ken,
Tessier, Russell, and Holcomb, Daniel. Characterization of long wire data leakage in deep submicron FPGAs. In ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGA) (2019), pp. 292–297. [64]
3. Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Characterizing
power distribution attacks in multi-user FPGA environments.

In Interna-

tional Conference on Field Programmable Logic and Applications (FPL) (2019),
pp. 194–201. Best Paper Award. [59]
4. Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Power wasting
circuits for cloud FPGA attacks. In International Conference on Field Programmable Logic and Applications (FPL) (2020), pp. 231–235. [62]

10

5. Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Power distribution attacks in multitenant FPGAs. IEEE Transactions on Very Large Scale
Integration Systems (TVLSI) 28, 12 (2020), 2685–2698. [61]
6. Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Mitigating voltage
attacks in multi-tenant FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS) (2020). (under review). [60]

11

CHAPTER 2
BACKGROUND AND RELATED WORK

This chapter provides background and related work for the topics covered in this
dissertation. Section 2.1 introduces a hardware-based monitoring approach and discusses how our work relates to other efforts to protect embedded systems. Section 2.2
provides perspectives on FPGA crosstalk attacks and the characterization performed
in this dissertation. Section 2.3 provides background on FPGA multi-tenancy, sensors, previous voltage attacks, and remediation.

2.1
2.1.1

Hardware-based Processor Monitoring
Basic Operation of Per-Instruction Hardware Monitors

A per-instruction hardware monitor checks and verifies the execution of software
on a microprocessor. A golden model of the processing steps representing correct
program execution is used to verify each step of the computation. After a deviation from expected control flow is detected, the hardware monitor can interrupt the
microprocessor, thus allowing for execution flow recovery.
Figure 2.1 illustrates the basic components of a hardware monitoring system (residing in the light blue shaded area) in an embedded system. Hardware monitoring
consists of two primary phases. First, the binary code of an application is analyzed
and a monitoring graph is derived, shown in the Offline analysis phase in Figure 2.1.
In the second phase, Runtime operation, the binary code of the application is loaded
into the instruction memory of the processor and the monitoring graph is loaded into
the graph memory (Mon. memory component in the figure) of the hardware monitor.
12

Figure 2.1. A simplified block diagram showing an embedded processing system
placed alongside a hardware monitor.

After the binary code and monitoring graph have been loaded, the processor starts
code execution. Each execution step is reported to the monitoring system (see the
Execution report signal in Figure 2.1). The monitoring graph verifies if the processor
is executing a valid instruction (see Comparison logic component in the figure). The
detection of an invalid instruction implies that the execution of the program has been
altered and its execution should be stopped. One way to deal with this case is to
force the processor to a safe state (Reset/recover signal in Figure 2.1).
The main challenge in the design of a hardware monitoring system is how to store
the monitoring graph in the memory in a way that the information about the set
of possible execution paths can be retrieved without slowing down the operation of
the processor. Section 2.1.2 discusses a design technique that allows the monitoring
system to meet this design requirement.

13

2.1.2

Extraction of the Monitoring Graph

The hardware monitor must examine each instruction committed by the processor
at each step without slowing down the speed of processor execution. To this end, the
comparison logic must check the validity of a committed instruction within a single
processor clock cycle. Chandrikakutty et al. [14] proposed a monitoring strategy that
requires a single read memory operation to verify a committed instruction. The monitoring graph is realized as a state machine where each state represents a monitored
code instruction. Each state can have zero, one or more outgoing transition edges
and each edge represents a possible valid next instruction1 . Each edge is labeled with
a value identifying the next valid instruction that can be executed after the current
instruction. This four-bit label is the hash value of the target instruction. The use
of hash values saves significant monitoring graph memory over saving whole 32-bit
instructions. This storage reduction comes at the cost of possible hash collisions.
Figure 2.2 shows an example of how processor instructions are translated into
states in the monitoring graph. Based on the memory layout [14], each entry of the
graph memory holds two pieces of information: 1) a tuple with all the hash values
of the next valid states and 2) information showing the location (e.g., address) of
next states in graph memory. Values can be retrieved using a single read memory
operation. Whenever the processor commits a new instruction, the hardware monitor
uses a hash function to compute its four-bit hash value and compares it against the
hash values for the valid next states. If there is a match, the hardware monitor uses
the address information to find the next entry in the graph memory.
These operations are considered in the following example. When the processor
executes the instruction at address 0x4a0, the hardware monitor fetches a value for
1

For instance, a control flow type of instruction has more than one possible next instructions
whereas an arithmetic operation can have only one. Zero outgoing edges is a special case which can
be used as the exit point of the graph.

14

Figure 2.2. During the offline analysis, the binary code of the application to be
monitored is analyzed in order to derive the monitoring graph.

state 0x4a0, which has only one outgoing edge labeled with a hash value of 7, from
the graph memory. In the next clock cycle, if the processor commits the instruction
at address 0x4a4 with a hash value of 3, a deviation from the expected hash value
of 7 is detected and the processor can be alerted to the potential attack. The use
of four-bit hashing can lead to hash collisions. Mao et al. [50] showed that even if a
single malicious instruction is undetected, the probability of a sequence of malicious
instructions having matching hash values decreases geometrically.

2.1.3

Hardware Monitoring Related Work

The monitoring of correct program execution has been proposed in various forms,
such as the verification of control-flow integrity (CFI) [1]. These software techniques
may slow down program execution and do not validate individual processor instructions. Hardware monitoring reduces the performance impact of monitoring. The
seminal work by Arora et al. described a fine-grained hardware monitoring system
that verifies correct execution at the granularity level of a basic block [9]. This work
was advanced by Mao et al. in verifying individual processor instructions and the
resulting ability to stop attacks within one processor clock cycle instead of having to

15

Table 2.1. Related Work on Hardware Monitoring.

Verification
Granularity
Target
Coverage
Overhead

Abadi
Arora
Mao
Pouraghily
This work
et al. [1]
et al. [9]
et al. [50]
et al. [58]
control flow
all processor instructions
operations
basic
single processor instruction
block
application
monolithic application simplistic OS
Linux OS
/ OS
application
system
entire application
entire OS
/ OS
calls
low hardware
software
high hardware cost
cost

wait until the basic block has ended [50]. Recent work by Pouraghily et al. further
expanded the previous work to not only monitor monolithic applications, but the
underlying operating system [58].
Chapter 3 also focuses on operating system monitoring. Unlike [58], the widelyused Linux operating system is monitored, not a simple embedded operating system.
The large code size of the Linux kernel makes previous approaches to monitoring
impractical due to their large overhead. In our work, we focus the monitoring effort
on the portions of the code that are particularly vulnerable to attacks: system calls.
Thus, we effectively can detect a number of different attacks while keeping the monitoring overhead low enough to make such a system practically useful. The progression
of work on hardware monitoring and the context of our contribution are summarized
in Table 2.1.

2.2

FPGA Crosstalk Leakage Attacks

In this section, we describe crosstalk coupling that exists between long wires in
FPGAs. The presence of a communication channel between adjacent FPGA long
wires (“long lines”) has previously been confirmed for both Xilinx [21] and Intel [67]
FPGAs. In both studies it was shown that the logic value carried on a wire changes

16

the delay of its immediate neighbor in a significant and measurable way. A logic
1 value transmitted by the victim effectively reduces the delay on the adjacent wire
and a logic 0 has the opposite effect. Effectively, the delay change allows a wire
to receive information about its neighbor, potentially allowing the information to be
used in a clandestine attack. Giechaskiel et al. [21] examined the effects of transmitter
switching rate and wire length, among other parameters. Ramesh et al. [67] showed
that crosstalk-based leakage could be used as a side channel to successfully obtain a
128-bit key from an FPGA implementation of the AES block cipher. The wire leakage
is well suited for use in side channel attacks, which are inherently robust to noise and
able to exploit small correlations between the side channel measurements and secret
data.
Both studies noted above relied on the use of a ring oscillator (RO) to receive
information from the victim (transmitter). One RO wire is adjacent to the victim
wire, and the frequency of oscillation is obtained by a binary counter triggered by the
RO. Figure 2.3 illustrates an example of an RO placed next to the victim wire in a
Cyclone IV FPGA device. As shown in Figure 2.4, whether the victim transmits logic
1s or 0s can be directly inferred from the values of the binary counter. The difference
in RO frequency for two trials is determined by using a relative count metric [67]
determined over two measurement periods. The count difference ∆RC when first a
logic 0 (first trial) and then a logic 1 (second trial) are transmitted can be represented
as:

∆RC =

C1 − C0
C1

(2.1)

where C 1 and C 0 are the measured counts for transmitted logic 1 and 0, respectively.
Although useful, the results of this approach depend on the delay of the entire RO
rather than just the delay of the wire adjacent to the victim. In this work we more

17

Figure 2.3. Ring oscillator (receiver) adjacent to a victim wire (transmitter) mapped
to a Cyclone IV FPGA device.

Figure 2.4. Ring oscillator count values when the victim transmits logic 1s or 0s in
an Arria 10 GX (10AX115N2F45E1SG) FPGA device.

18

precisely quantify the delay effects caused by wire adjacency to make the characterization of transmitter values clearer.
The precise delay characterization of FPGA wires has been explored in several
contexts unrelated to signal adjacency. Yu et al. [87] used ROs to measure the delay
of a number of FPGA resources, including channel wires in isolation. Gojman et
al. [27] employed a path-based approach to consider delays of all channel wires in
an FPGA. Fine-grained FPGA timing measurement using time-to-digital converters
(TDCs) was used by Gnad et al. [24] to assess process variations. None of these studies
considered the differentiation of same-wire delays due to the behavior of surrounding
wires.

2.3
2.3.1

On-FPGA Voltage Attacks
Multi-tenant FPGA Threat Model

We consider the following threat model for attacks on the FPGA PDN. Multiple
independent users can implement and execute circuits in an FPGA at the same time.
Their logic and interconnect resources may be isolated, and each user only has access
to the logic design of their own circuit. There are no physical connections (i.e., wires)
shared by the circuits. The software accessed by the designers which interacts with
the FPGA is secure as is the interface logic provided in the FPGA. Each user has the
flexibility to implement any circuit in their assigned portion of the FPGA.
This multi-tenant threat model arises in a number of user scenarios, as documented
in a recent survey [34]:
Untrusted IP cores: User designs often integrate one or more IP cores from untrusted vendors. These IP cores can contain malicious circuitry in the form of a hardware Trojan. Malicious hardware can put the integrity of the system at risk [11, 75]
by altering its behavior, disabling or bypassing hardware-based security mechanisms,
or physically damaging the device. Although Trojan detection techniques [49] can

19

be used to identify malicious circuits, in many cases, IP cores are distributed as obfuscated or encrypted bitstreams. IP core network connections enhance this threat
for systems ranging from embedded systems to single-user cloud FPGA deployments
that use IP cores.
Malicious cloud providers: Although unlikely, the possibility of a malicious
cloud vendor exists. Effectively, the cloud vendor support circuitry on the FPGA can
be thought of as an added tenant whose circuitry is not validated by the user.
Malicious co-tenants: Although not currently supported commercially, it is
widely expected that multiple independent users will eventually be able to simultaneously share a cloud FPGA substrate [34, 35]. The ability to commercially use a cloud
FPGA for multiple independent users in the future depends on a full understanding
of the inherent security weaknesses of current FPGA architectures, including those
exposed by the experiments described in this article. Several cloud-based systems
that follow this model have been presented as proofs of concept. Khawaja et al. [35]
proposed the use of an operating system for shared access to a cloud-based FPGA.
The system allows for multiple users to execute circuits at the same time on an FPGA.
The device I/O and memory interfaces are fairly shared across users. The PDN in
the Xilinx or Intel FPGA is also shared in this model. In Knodel et al. [37], the logic
resources in an FPGA located in a cloud node are allocated to interface logic and a
collection of virtual FPGAs (vFPGAs). Resources are managed by tools running on
the node’s microprocessor.
In Chapter 7, we describe an attack on an RSA encryption core that involves fault
injection via on-chip voltage manipulation. Among our three multi-tenant scenarios,
this type of attack could be performed by either a malicious IP core with network
access or in a cloud environment by a malicious co-tenant. An RSA encryption key
is obtained from the erroneously encrypted output of the circuit due to an attack.

20

2.3.2

FPGA Voltage Sensing

One approach to identifying FPGA voltage attacks is to implement distributed
voltage sensors fashioned from FPGA logic throughout the logic fabric. The ability
to identify voltage levels on an FPGA has many uses ranging from verifying safe
FPGA operation [91, 92] to the extraction of secret information [89]. Contemporary
FPGAs often contain at least one hardened voltage sensor [30] per chip for power
supply voltage measurement. Additional on-chip FPGA voltage measurement circuits
typically are based on either ROs or TDCs. As discussed in Section 2.2, the frequency
of the oscillation can be measured by connecting the RO to a counter. Although RO
frequency is affected by temperature [10], voltage fluctuations have a much stronger
effect [25]. TDC-based sensors are based on a combinational chain of buffers that
are triggered by a clock edge [93]. The output of each buffer is sampled by a clocktriggered flip-flop, and voltage values can be determined by how far a rising edge
propagates through the chain in a clock cycle. Although requiring more resources to
implement effectively, TDCs can be used to measure instantaneous voltage changes
on the order of a clock cycle [82]. Given our interest in voltage changes due to attacks,
we select a network of simpler but highly effective ROs for our monitoring system.

2.3.3

FPGA Voltage Attack Response

On-chip FPGA voltage responses to supply voltage manipulations have been previously studied, although none focus on the specific issues addressed in this article.
Zick et al. [93] described a new TDC-based voltage sensor that can identify on-FPGA
voltage transients in the nanosecond range. A single sensor was used to characterize changes in TDC delay in the presence of significant signal switching. Although
changes in TDC delay over time tracked off-chip voltage measurements taken with
an oscilloscope, on-chip voltage values were not determined, and voltage responses
across the die were not considered. Gnad et al. [24,25] examined the impact of power

21

waster activation on TDC delay across an FPGA die. TDCs were distributed across
the FPGA surface and average and worst case TDC delays were evaluated over time
for varying workloads. Instead, our approach considers on-chip voltage values for
numerous individual sensors located columns away from the power-wasting source.

2.3.4

FPGA Voltage-Induced Faults

Several studies have examined the ability of on-chip FPGA power wasters to drive
a board into reset or induce delay faults in adjacent circuitry. Gnad et al. [26] showed
that the sudden activation of thousands of ROs could drive boards with Xilinx FPGAs
into reset, requiring a bitstream reload. Although this attack results in a denial of
service, it is not capable of stealthily extracting information from an unsuspecting
circuit.
More recently, several researchers have examined the possibility of injecting delay
faults into neighboring circuits using power wasters. Krautter et al. [39] examined the
possibility of injecting faults into an advanced encryption standard (AES) core at a
number of operating frequencies and circuit minimum slack values. This work did not
examine the ability of a waster to induce faults at distant locations on an FPGA’s die
nor consider the effects on signals with a wide range of slack values. In Mahmoud and
Stojilović [48], a fault-inducing attack on true random number generators (TRNGs)
using ROs was described. The ROs were placed adjacent to TRNGs, and TDCs were
used to evaluate induced delay changes. Recently, Alam et al. [3] showed that allowing
a user to intentionally cause write collisions in FPGA dual-port block RAMs can also
induce voltage and temperature fluctuations and result in circuit faults. Our work
significantly extends previous fault analysis studies by considering a power waster’s
ability to induce faults at numerous locations on the FPGA die for paths with a
spectrum of slack values.

22

2.3.5

FPGA Voltage Attacks on Encryption Cores

Encryption cores are a popular target for on-chip FPGA side channel or fault
injection attacks. Prior work has shown that a shared FPGA PDN creates coupling
between power wasters and an encryption core. This coupling has been exploited
for side channel attacks [71, 89] in which an encryption key is extracted from an
unsuspecting victim crypto circuit. Both RO [89] and TDC-based [71] voltage sensors
were used successfully for key extraction in simple and correlation power analysis
attacks, respectively. In both cases, the power consumption of the crypto circuit
was tracked on a per-cycle basis to identify specific key values. Schellenberg’s TDCbased attack [71] was successfully replicated on an Amazon EC2 F1 cloud FPGA [23].
Mahmoud et al. [47] inserted a Trojan within the encryption core that is activated by
a voltage drop induced by the power waster. This approach requires Trojan insertion
during core design. Krautter et al. [39] extracted an AES key by enabling power
wasters at specific points in encryption core operation. Our encryption core attack
approach does not require core modification nor carefully-timed activation at a specific
core execution point to work effectively.

2.3.6

FPGA Voltage Attack Remediation

Several studies have examined techniques to identify and suppress significant onFPGA voltage swings. Shen et al. [72] identify voltage transients caused by user
circuits. A clock edge suppressor is used to delay the circuit clock edge in an effort to
control voltage drops. Krautter et al. [40] identify circuits that are likely to induce onFPGA voltage drops (e.g., ring oscillators) from FPGA bitstreams. These circuits can
be flagged and removed prior to FPGA bitstream loading in a cloud environment. Zick
et al. [93] proposed using voltage information from multiple voltage sensors to monitor
device health and potentially suppress malicious behavior. Our work extends these

23

efforts by collating voltage information from numerous on-FPGA voltage sensors to
localize the source of attack circuitry on the FPGA, leading to possible remediations.

24

CHAPTER 3
HARDWARE-BASED MONITORING FOR
PROTECTING SYSTEM CALLS

Embedded systems are designed with limited processing capabilities and energy resources and as such, they cannot effectively utilize existing software-based protection
mechanisms (e.g., anti-virus software). In addition, the need for network connectivity and remote deployment make these devices particularly vulnerable to numerous
attacks. In prior work, a series of hardware-based monitoring solutions have been
proposed [1, 9, 28, 50, 58, 81, 83] and proved to be successful in defending specific applications and simplistic embedded operating systems from software attacks. These
solutions, however, become unsuitable for protecting state-of-the-art systems containing millions of instructions. In this chapter, we present a novel hardware-based
monitoring technique that can detect if the system-calls of a sophisticated embedded
operating system, such as Linux, deviate from the originally programmed behavior
due to an attack.
Section 3.1 discusses how monitoring can be limited to system calls in the operating system to limit monitoring overhead. The principles of our monitoring system
are described in Section 3.2. The design and implementation of our prototype system
are presented in Section 3.3. Experimental results are shown in Section 3.4, and we
form conclusions in Section 3.5.

3.1

Monitoring Linux System Calls

System call monitoring tracks system calls that are executed by an application,
often at a level that is much coarser than tracking individual processor instructions.
25

A survey on system-call monitoring [17] describes how this type of monitoring has
evolved over time. The main difference between this work and our approach is that we
do not track patterns of system calls. Instead, we focus on ensuring that the processor
instructions associated with a system call are executed faithfully. This approach
ensures that attacks via system calls do not succeed. The existing approaches to
system call monitoring can be used orthogonally to our work.
The current Linux kernel (version 4.13.15) contains code for 337 different system
calls. Between 1999 and 2017, 1,931 vulnerabilities in the Linux kernel were reported
to the Common Vulnerabilities and Exposures (CVE) database that is maintained by
MITRE. Of those, 45 vulnerabilities (2.3 %) directly relate to system calls. This may
seem like a small percentage. However, the existence of a vulnerability is particularly
problematic if an exploit exists that can let an attacker use the vulnerability in a
practical manner. Of 148 publicly available exploits (listed in the Exploit Database
maintained by Offensive Security) that lead to privilege escalation attacks (which
gives the attacker full control over the system), 25 exploits (16.9 %) are based on
vulnerabilities in system calls.
A typical attack, as we describe in more detail in Section 3.3.1, uses a buffer
overflow to redirect program execution to shell code or other attack code. Since
the kernel operates at the highest level of privilege in the system, achieving the
execution of malicious code through redirection of a system call can give an attacker
the highest level of access. By protecting system calls from such attacks through
verification of correct execution, which can detect buffer overflow attacks that change
code execution, we can protect the system from exploits that use known and unknown
vulnerabilities. This protection works for attacks that are launched through software
that is executed on the system directly, as well as attacks that are launched remotely
through the network.

26

3.2

Architecture of the Monitoring System

In this section, we describe the general system architecture and operation of our
monitoring system. In securing system calls with the hardware monitoring system, the
following security issues are considered. The main goal of our monitoring system is to
prevent execution deviations from system calls to malicious code. If such a deviation
is detected, execution is stopped and the processor is reset. Our security model
assumes that an attacker may access the target system and tamper with processor
instructions and data remotely through an I/O interface, although it is not possible
to tamper with the monitoring system.

3.2.1

Basic Monitor Operation

As mentioned in Section 2.1.1, hardware monitors are components that are colocated with processor cores to track the processing of software on that core. The
objective is to assess the operation of the processor and determine when incorrect
behavior is detected (which can be due to benign faults or malicious attacks). In our
work, we use a hardware monitor that receives information about every instruction
executed on the processor core and compares it to a monitoring graph that is generated
from the processing binary. Each instruction is represented by a hash value (to
reduce the size of the monitoring graph compared to the size of the binary) and state
transitions correspond to possible control flow paths between instructions. We use
a deterministic finite automaton (DFA) representation of the monitoring graph (as
detailed in [14]).
For this work, a monitoring graph is generated during design compilation (see
Section 2.1.2) for selected system calls. A detailed view of our monitoring subsystem is shown in Figure 3.1. The portions of the monitoring system can be split into
monitoring hardware (three boxes in upper left corner of the figure), which checks the
per-instruction operation of the companion processor, graph memory, which stores

27

recovery
signal

CPU
interrupt
controller

hash
comparison

one-hot
encoding
CPU
instruction

from th e
CPU
pipeline

16
4
pos ition of
matching
hash in the
hash vector

17

4

sequencing
logic

17

0x00010h
0x01c20h

group3
group4

0x13003h
0x18121h

hash
calculation

0x00000h:

0

17

address
pointer

1
0

1

...

...

+

17
read
address

...

base
valid
address
0x00000h
1

0x40f2a840h
x
0x08000h
0
x
0x10000h
0
x
0x18000h
0
system call to frame binding

system call
address
retu rn P C
ID
pointer
0x40f2a840h 0x40f2a840h 0x23bd7h

...

0x0c000h + 0x00 010h: next state

...

DMA
...

...

group1 address
group2 address

frame
address

system call addres s CAM

system call ID

...

slots 2 and 3
regions
0x0c000h + 0x00 000h:

...

valid hash

slot 1 region

overwrite

control unit

group1 address

0x00010h: next state

base addresses
register file
valid

read data

group2 address

...

program
counter

system call s tarting
address (I D)
0x40f2a840h
0x42acc564h

37

write data
group1
group2

...

valid hash

...

slot 4 region
graph memory

retu rn information stack

controller

from graph pool

Figure 3.1. System architecture for a hardware monitor that supports selective
system call monitoring.

monitoring graphs, and controller. The monitoring hardware checks each processor
instruction using an entry from the monitoring graphs stored in graph memory. In
Figure 3.1, graphs for four separate system calls are stored in slots in the graph memory. Each graph includes one row per instruction, effectively representing expected
program control flow as a state machine. A read address pointer indicates the entry
in the graph that corresponds to the instruction that has just completed execution.
During the execution of an instruction, a multi-bit (in our case 4-bit) hash value
of the instruction is generated and converted to a one-hot representation. Previous
work has shown a 4-bit hash value to be sufficient to limit collisions [83]. The one-hot
encoding is compared against the expected next instruction hash values (valid hash)
that are stored in the graph entry for the previously executed instruction. The use
of a one-hot representation simplifies these comparison operations.
A match of an instruction hash against a stored valid hash indicates a valid instruction. If no match occurs, an illegal instruction has been executed, leading to

28

the generation of a recovery signal which is used by the processor for process termination. Since control flow instructions (e.g., branch) may have several possible next
instructions, and, consequently, several possible valid hashes, multiple one-hot valid
hash bits may be set per entry. A match of any of these hashes indicates a valid
instruction. Our approach can handle dynamic branch targets by profiling the code
to determine all branch targets for a system call prior to graph generation. Entries
for these targets are then added to the graph.
The next read address (memory row) in the monitoring graph is determined using
next state information stored in the current entry, the matched hash value, and
information stored in base address registers which group states based on fan-in count
[14]. These values are combined via addition in the sequencing logic box in Figure 3.1.
The resulting address is stored in the address pointer and subsequently added to the
start address for the appropriate graph slot for the system call. The implemented
monitor requires only one memory lookup per instruction. Effectively, the monitoring
information for each system call at any given point in execution is defined by the
contents of the address pointer, the monitoring graph for the process and the contents
of the base address registers. The location of each system call monitoring graph in the
graph memory is stored in the system call to frame binding memory. The procedure
required for activating monitoring for system calls is described next.

3.2.2

Enabling and Disabling Monitoring

Since monitored system calls can be invoked from within user applications or unmonitored system calls, a mechanism to seamlessly enable and disable the hardware
monitor once a system call is invoked or retired is included in our monitoring system.
Monitoring is stopped after the monitored system call is finished and the user application or unmonitored system call execution is restarted. We consider four specific
scenarios: (1) a monitored system call is called from an application (monitor acti-

29

vated), (2) a return from a monitored system call to an application or unmonitored
system call (monitor deactivated). (3) an unmonitored system call is called from a
monitored system call (monitor deactivated), and (4) a return from an unmonitored
system call to a monitored system call (monitor activated).
Scenario (1): Call to monitored system call from unmonitored code.
After Linux is compiled into a loadable image, the addresses of the kernel functions
and system calls are fixed. The starting address of each system call is used as a
unique identifier. For each system call, there is only one entry point, which is used to
trigger the monitor. The hardware-based solution triggers monitoring upon entry into
a system call by matching the system call program counter to one of a series of valid
stored values (valid bit = 1) in a content addressable memory, shown in Figure 3.1 as
the system call address CAM. As a transition to the monitored system call is made,
the monitor is enabled.
For example, when the microprocessor executes an instruction, the program counter
which has been extracted from the exception stage of the processor pipeline is compared against all of the valid system call starting addresses in the CAM. If it matches
a stored address, the monitor is activated to start tracking microprocessor code execution using the monitoring graph generated during the compilation process for the
system call. Prior to Linux execution, the CAM is loaded with the start addresses of
the monitored system calls. Information in the monitor, including monitoring graphs
and the system call address CAM, are loaded through a secure channel that is not
accessible to application users. Any modifications to the CAM table is performed
using known secure techniques [28].
Scenario (2): Return from monitored system call. A scalable approach is
used to disable the monitor upon leaving a monitored system call since multiple exit
points in the call may exist. To avoid using a large CAM to match the PC against
all exit points, monitor disable information is embedded within the monitoring graph

30

of the system call. As discussed in Section 3.2.1, each entry in the monitoring graph
contains a one-hot encoding of the valid hashes for the next instructions which succeed
the current one. Normally, one or more of those bits are set to one according to
the number of legitimate next instructions. However, if the instruction is the last
instruction of the system call, all hash bits are set to zero indicating a system call
return. This value disables monitoring.

3.2.3

Handling Nested System Calls

The mechanism described above is most effective if the call to a monitored system
call is made from application code and a return to this code is made when the system
call terminates. However, in many cases a monitored system call may invoke another
system call that may be monitored or unmonitored. Thus, monitoring may need to
be suspended for a time and then restarted upon return to the monitored system call.
Scenario (3): Call to unmonitored system call from monitored system
call. If unmonitored code is called from the monitored system call, the return address
of the monitored code is stored on the return information stack and the monitor is
deactivated. When a current, monitored system call switches to a new system call,
its return address is stored on the stack. The stack consists of three different fields:
system call ID which is the starting address of the monitored system call, return
PC which is the next PC of the current system call which will be executed on the
microprocessor after returning from the callee, and finally the current pointer to the
monitoring graph of the current system call.
Scenario (4): Return from unmonitored system call to monitored system call. When a return is made from the unmonitored code to monitored code,
the return PC is checked against the top of the return information stack to determine
if monitoring should be re-enabled. If a return is made to the monitored code, the
monitor is reactivated and the return PC is popped from the stack.

31

3.3

Prototype System Implementation

Our experimental system uses a 7-stage LEON3 processor, release 2017.2-b4193
[19] and an attached hardware monitor. The floating point unit was not included in
the design. The hardware was synthesized and mapped to a Stratix IV FPGA on
a Terasic DE4 [76] board with 1GB of DDR2 memory. To perform monitoring, the
instruction under execution and the program counter (PC) from the processor are
tapped for use by the monitor. For monitoring to work effectively, it is necessary
to ensure that only committed instructions are monitored, since a number of fetched
instructions may be flushed or annulled from the processor pipeline. For this reason,
the PC and associated instruction are tapped from the exception stage of the processor
after the annul signal can be examined. As discussed in Section 3.2.1, the instruction
is subsequently converted to a hash value and compared to a stored entry in the
monitoring graph. The PC is used to determine if monitoring should be enabled or
disabled. If the monitor detects a deviation from expected computation, the processor
is reset using a recovery signal. Detection and reset takes place if an inappropriate
instruction is executed. The processor additions needed to tap the PC and instruction
are negligible and our results show no loss of processor clock speed performance as a
result of this action.
In a secure system, all system calls should be monitored to prevent a systemcall-based attack. Our monitor micro-architecture shown in Figure 3.1 is designed to
monitor a subset of calls, as needed. For this work, we focus on the four system calls
shown in Table 3.1 (more calls can be easily added). We chose these four system calls
since the first contains the known vulnerability CVE-2013-1828 and others have been
characterized as particularly vulnerable calls and used for kernel exploitation [45].

32

Table 3.1. Monitoring graph sizes for four Linux system calls
System call
getsockopt
execve
open
mmap
3.3.1

Number of
instructions
49,252
49,816
37,953
171

Number of
entries
68,422
70,318
54,520
254

Graph
size (bits)
2,531,614
2,601,766
2,017,240
9,398

Attack Scenario

To evaluate the ability of our monitor to detect and prevent an attack, we tested
our processor/monitor system with a known and published Linux attack from the
Exploit Database, ID 24747 [53] and an additional attack that is derived from it. The
latter attack exploits a vulnerability in the function sctp getsockopt assoc stats() of
the getsockopt system call and leads to a privilege escalation.
In the function, a call to copy from user() is used to copy the contents of a userprovided buffer into a data structure defined inside the function’s local scope. Since
there is no size check before calling the function, the user can provide a buffer to
the system call which is bigger than the size of the local buffer. Therefore copying
the buffer contents to the sctp getsockopt assoc stats() function’s local stack frame
can overwrite substantial portions of the stack. One version of the exploit crashes
the processor-based system by overwriting the return address of the function with
an invalid address (0x41414141). The attack can also be used to force a privilege
escalation that silently changes the access level of the user from a limited normal user
to root access.
In Linux, the /etc/passwd plain text file holds information about user accounts
and their access levels. By modifying this file, one can grant any account root access.
However, all users except root can only read this file and write access to this file is
only granted to the root account. In our attack, instead of rewriting the stack with
random data, therefore destroying the return address, system call information is fed
to a buffer with meaningful data so that a user can gain root access. Specifically,
33

Figure 3.2. Console output showing that the attack script changes the test account
privilege from a normal user to root.

the return address of the sctp getsockopt() function is changed to the starting address
of the call usermodehelper() function which is a part of the kernel and is used to
prepare and run a user mode application from within the kernel. Using this function,
/bin/sed, a stream editor in Unix based operating systems, is executed to rewrite
/etc/passwd and grant root access to the user. Figure 3.2 shows the attack in action.
A key aspect of this attack is the writing of the attack arguments to call usermodehelper() that are passed to /bin/sed and the branch to the function from the
system call sctp getsockopt(). When call usermodehelper() is called, it receives its
four operands on the stack, (two char*, one char**, and an int). Using monitoring, it is possible to detect the unexpected branch to call usermodehelper() since
the instructions of this function will not have entries in the monitoring graph. The
microprocessor is reset in this case to prevent the user from attaining root access.

3.4

Experimental Results

To evaluate performance, our processor and monitor architecture was mapped
to the DE4’s Stratix IV EP4SGX230 FPGA. A maximum system clock frequency
of 110 MHz was achieved both with and without the monitor. Signals internal to

34

the FPGA were monitored using Intel SignalTap, leading to the waveforms shown in
Figure 3.3. The observed waveforms come from an attempted return from the system
call function sctp getsockopt(). Figure 3.3 (top) shows processor behavior during a
normal return from the function starting at cycle 130. At this point, the one-hot hash
encoding (0000 0000 0000 0100) of the next instruction matches one of the acceptable
encoded valid hashes in the stored monitoring graph (0100 0000 1110 0100) in bit 2.
The same observation can be made for the hashes of the next instruction. Thus, the
instruction execution matches one of the expected execution paths determined during
design compilation.

3.4.1

Attack Detection and Recovery

Figure 3.3 (bottom) shows the details of monitoring activities when the attack
described in Section 3.3.1 is performed. In this case, the return address of the
sctp getsockopt() function has been overwritten with the address of the call usermode
helper() function. Since the first instruction in this function was not an acceptable
return target for sctp getsockopt(), the one-hot hash of this instruction will not match
a valid hash value in the monitoring graph entry. Figure 3.3 (bottom) shows that
this is the case. The one-hot hash of the instruction at cycle 130 is (0000 0001 0000
0000) while the stored valid hash value is (0100 0000 1110 0100). Since bit 8 is not
set in the valid hash value, an unexpected instruction has been executed and the
processor reset (recovery signal) can be asserted low. Note that the set bit in the
one-hot hash of the next instruction also does not match the appropriate bit in the
valid hash value. It should be noted that although the reset signal causes the processor to restart, possibly leading to a denial of service attack, this outcome is preferable
to an unauthorized user gaining superuser access to the system.

35

Figure 3.3. Waveforms showing normal execution of the system call (top) and detection of the attack (bottom).

36

Table 3.2. Resource utilization of the hardware monitor and LEON3 processor
Resource

Available

LEON3

Logic LUTs
Memory LUTs
Flip flops
On-chip mem.
Off-chip mem.

182K
91K
182K
13.9Mb
8,192Mb

20,070
170
15,053
522Kb
95.7Mb

3.4.2

Monitoring
system
380
0
324
2,983Kb
0

CAM/stack
6,555
0
11,457
0
0

Secure HW
mon. loader
2,603
0
2,936
954Kb
0

Monitoring System Overhead

Using the graph generation approach described in Section 3.2.1, we examined the
size of four representative Linux (version 3.8.0) system calls, including the getsockopt
call described in Section 3.3.1. The number of instructions, the number of monitoring
graph memory entries, and the total graph sizes in monitoring graph memory in bits
for each system call are shown in Table 3.1.
In this analysis, we consider overheads associated with implementing a hardware
monitor on-chip with the microprocessor. For performance reasons, the monitoring
graph is stored on-chip to allow for instruction-by-instruction hash value comparisons.
Thus, we assess both the logic overhead and the overhead of on-chip memory. In
addition, if a new system call is used, its monitoring graph may need to be securely
loaded from off-chip memory using DMA to one of the graph memory slots shown in
Figure 3.1 [58]. The resources needed to implement the microprocessor, the monitor
and its associated graph transfer circuitry are shown in Table 3.2.
The table shows that the monitor and associated control circuitry require dramatically less circuitry than the processor since it is a simple finite state machine. On-chip
memory is needed so that each graph entry can be quickly obtained and compared to
the currently-executing instruction. The table also includes the resources needed to
securely load encrypted system call monitoring graphs from external memory. This
circuitry includes a decryption circuit which increases the overhead of the interface.
Finally, the resources needed to implement the system call address CAM and return

37

information stack used to identify monitoring start and stop points (described in
Section 3.2.2) for up to 337 different system calls are shown in the table.
By far, the most expensive part of monitoring is the on-chip memory needed to
store the monitoring graphs. In this system, Linux instructions are stored off-chip
so monitoring storage takes up the bulk of the on-chip storage. In our design, the
monitoring graphs consume less than one-quarter of available on-chip memory so
sufficient space is available for other circuitry. Overall, our results show that system
call monitoring for advanced embedded operating systems, such as Linux, can be
performed efficiently.

3.5

Conclusion

System calls in sophisticated embedded operating systems are known to be vulnerable targets for attackers. We present a low-overhead monitoring approach that
allows for selective instruction-by-instruction monitoring of system calls. Our approach has been demonstrated in hardware to successfully identify and prevent a
known Linux system call attack. The overhead of the monitor is modest and does not
impact the performance of the microprocessor. Thus, our work presents an effective
tool to protect embedded processing systems.

38

CHAPTER 4
CHARACTERIZATION OF FPGA LONG WIRE DATA
LEAKAGE

Chapter 3 introduced an effective monitoring technique to protect embedded systems against software attacks that attempt to hijack the execution flow. In this
chapter we investigate side channel attacks, specifically on FPGAs. Side channel attacks attempt to steal secret information stored in a device or disrupt its operation
by exploiting the intrinsic physical properties of the hardware implementation of a
computing system. In this chapter and the next one, we study two classes of side
channel attacks which are broadly categorized as crosstalk-based information leakage
and on-chip FPGA voltage attacks. Both classes of attacks are directly relevant to the
use-case scenario where an FPGA device is simultaneously used by multiple tenants.
While the multi-tenant use of FPGAs provides a mechanism for maximizing the
utilization of FPGA resources, it also presents unique security challenges. In this
chapter, we address several important issues related to crosstalk-based attacks in
SRAM-based FPGAs:
• We present a precise characterization of the effect of a neighboring wire on a
channel wire’s delay. This new model is shown to be robust across a range of
hardware implementations of attack circuitry.
• In some cases, FPGA companies do not publicly disclose wiring adjacency for
FPGA routing channels, limiting a user’s ability to ensure that wires adjacent
to critical routes are unused. In this work, we show that it is straightforward

39

to determine routing adjacency for all channel wires using crosstalk effects as a
guide in building an adjacency map.
The methodology used in our work is detailed in Section 4.1 In Section 4.2 we
explain our approach for determining channel adjacency. The sensitivity of each wire
to coupling is addressed in Section 4.3, and we conclude the chapter in Section 4.4.

4.1

Methodology

We perform experiments on three different classes of FPGAs that were fabricated
in different technology nodes. Our experiments are performed on two Cyclone IV GX
(EP4CGX150DF31) FPGA Development Kits, one Stratix V (5SGXEA7K2F40C2N)
GX Development Kit, and one DE5a-Net Arria 10 GX (10AX115N2F45E1SG) FPGA
Development Kit. The Cyclone IV, Stratix V, and Arria 10 devices are implemented
in 60 nm, 28 nm, and 20 nm CMOS technologies, respectively.
Figure 4.1 shows the block diagram of the test setup used to assess the long
wire covert channels in these system types. In each experiment, the transmitter and
receiver are implemented in the FPGA in one or more vertical long wires. A test
pattern generator assigns either a logic 1 or a logic 0 to the transmitter in each trial
and the effect on the frequency of the receiver is measured by counting its oscillations
during 1,024 trials of 21 ms each unless noted otherwise. Half of the 1,024 trials
use a transmitted value of 1, and the other half use a transmitted value of 0. The
ring oscillator, transmitter and receiver are placed and routed using place and route
constraints.

4.1.1

Metric

We introduce a new metric ∆t that captures the amount by which the value of
the transmitter affects the propagation delay of transitions on the receiver wire. This
metric is designed to eliminate the RO-based variability introduced by the ∆RC metric

40

FPGA
FPGA

Figure 4.1. Experimental framework for evaluating long wire delay effects on SRAM
FPGAs [67].

in Eq. (2.1). The changes in propagation delay on the receiver wire are on the order
of 100s of femtoseconds, and cannot be measured directly. However, they can be
inferred from frequency measurements collected by on-chip circuitry counting ring
oscillator cycles.
During each period of a ring oscillator, every circuit node in the ring makes exactly
one rising and one falling transition. The period of a ring oscillator that contains a
particular receiver wire of interest can be described as the sum of four terms: dr x↑
represents the propagation delay of a rising transition on the receiver, dn↑ represents
the summed propagation delays of one rising transition on all other ring nodes, and
dr x↓ and dn↓ represent the receiver and summed ring node delays for the corresponding
falling transitions. Using superscripts to denote the value of the transmitter during
a measurement, the frequency of the ring when the transmitter holds a value of 1
can therefore be written as f (1) shown in Eq. (4.1). Term f (0) is defined analogously
for the case of a 0-valued transmitter. Measuring the frequency of the same ring
oscillator with a transmitted 0-value and 1-value allows for calculating ∆t as shown

41

in Eq. (4.2). The delay terms (dn↑ and dn↓) that are unrelated to the receiver wire
cancel out from the two frequency measurements, leaving only the delay changes in
the receiver wire.
The value ∆t represents the change in propagation delay on the receiver wire that
is caused by the change on the value of the transmitter. More precisely, as shown by
the second line of Eq. (4.2), ∆t is the average propagation delay change over rising
(dr x↑) and falling transitions (dr x↓) of the receiver. We make no claim as to whether
the change in receiver delay is occurring predominantly on one transition or equally
on both.

1

f (1) =

dn↑ + dr(1x↑) + dn↓ + dr(1x↓)
1

f (0) =

1

∆t =

f (0)

−
2

(4.1)

dn↑ +


1
f (1)

=

dr(0x↑)

+ dn↓ + dr(0x↓)

 

dr(0x↑) − dr(1x↑) + dr(0x↓) − dr(1x↓)
2

(4.2)

The ∆t metric is different from the metric of fractional change in oscillator counts
that is used in prior work [21, 67] and also shown in Eq. (2.1), and we use Figure 4.2
to demonstrate the motivation for using the new metric. The two cases shown in
Figure 4.2 use the same neighboring transmitter and receiver wires on the same
Cyclone IV chip, with the transmitter and receiver wires running parallel to each
other for a length of 10 C4 wire segments running upward from position X113Y2. The
only difference between the two scenarios is that the ring oscillator circuit used for
measurement in the figure at left has extra wiring delay added to the ring intentionally,
such that its period is roughly 50 % higher than the circuit used for the figure at right.
The added delay is on a part of the ring away from the transmitter and receiver wires,
and therefore does not impact their coupling. A good metric should indicate the same
42

Figure 4.2. Figure shows measured receiver frequency for the same transmitter and
receiver wires when measured with two different length ring oscillators.

amount of coupling in both cases. In each case, we measure the oscillator frequencies
as shown and compute the value of ∆t using Eq. (4.2), obtaining values of 3.28 ps
in the first case and 3.32 ps in the second. The good agreement between the two
experiments demonstrates that ∆t captures the change of the receiver delay while
being insensitive to the overall ring delay. The prior metric of ∆RC is sensitive to the
overall ring delay and yields a fractional change in oscillator frequency of 2.66e-4 and
4.05e-4 in the two experiments (a 52 % discrepancy), despite no changes to the part
of the circuit in which the coupling occurs.
Removing the dependence of the characterization metric on oscillator frequency
is important for accurately characterizing the leakage, because the ring oscillator
frequency will inevitably change across experiments that vary parameters such as
technology node, or the length or type of wires used for transmitter and receiver.

43

4.2

Recovering Channel Layout Through Measurement

FPGA vendors differ in terms of providing customers with easy access to channel
adjacency information for their FPGA devices. Adjacency information for Xilinx
SRAM FPGA devices can be visually determined from Vivado floorplanning tools,
version 2018.2. However, the corresponding view in Intel’s visual editor does not
allow a user to infer adjacency (Quartus Prime v18.1). Knowledge of adjacency is
necessary if a user wishes to deploy fine-grained isolation by ensuring that sensitive
wires have no neighbors that could snoop on their values using crosstalk.
In this section we show that characterizing the coupling between wires makes it
possible to infer channel layout, which could enable design isolation techniques to
reduce the risk of leakage between adjacent wires. Layout/adjacency information of
a channel is inferred by testing all possible transmitter-receiver pairs in the channel,
and measuring the value of ∆t to check for evidence of coupling for each pair. Wires
that impact each other are reasonably assumed to be neighbors in the channel.
The C4 channel in a Cyclone IV device has 96 wires, of which 48 travel in the
upward direction. We explore these 48 wires to determine which are neighbors. Each
LAB can connect to 12 of the 48 wires, and it takes a vertical span of 4 logic array
blocks (LABs) to fill the channel. Each of these 48 wires can be the receiver or
transmitter, so 2,304 pairs of wires are considered to exhaustively characterize the
channel. Figure 4.3 demonstrates the coupling that exists between all pairs of the 48
wires. The measurements are collected using transmitter and receiver wires that are
10 C4 wires long, and then normalizing the value of ∆t to the length of a single C4.
In particular, the 48 wires in the channel being characterized are driven from
LABs X12Y2, X12Y3, X12Y4, and X12Y5. The correspondence between the indices
0 − 47 and the physical resources of Cyclone IV are given in Table 4.1. Looking
carefully at Figure 4.3 we can see for example that transmitters at the 16th and 40th
indices induce significant values of ∆t on a receiver in the 4th index. This implies that

44

Index
0
1
2
3
4
5
6
7
8
9
10
11

Logic Element
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2
LCCOMB X12 Y2

N0
N2
N4
N6
N10
N14
N16
N18
N20
N22
N24
N28

Wire in Channel
X12Y3S0I0
X12Y3S0I1
X12Y3S0I2
X12Y3S0I3
X12Y3S0I4
X12Y3S0I5
X12Y3S0I6
X12Y3S0I7
X12Y3S0I8
X12Y3S0I9
X12Y3S0I10
X12Y3S0I11

Table 4.1. Correspondence between indices of Figure 4.3 and physical resources on
the target Cyclone IV device2 .

wires X12Y4S0I4 (16th index) and X12Y6S0I4 (40th index) are likely the neighbors
of X12Y3S0I4 (4th index). For most of the 48 wires, when used as receivers, we are
able to identify two other wires that as transmitters cause significant values of ∆t.
These are hypothesized to be the left and right neighbors in the channel. Some wires
in the channel do not have two clear neighbors, and this will be investigated in future
experiments. The coupling is observed to be bidirectional; if there is a significant
effect when the transmitter is index i and the receiver is index j, then a similar value
of ∆t will occur when the receiver is index i and the transmitter is index j.

4.3
4.3.1

Characterization
Susceptibility of Each Wire to Leakage

We are able to identify neighbors for all C4 wires in the channel of the Cyclone
IV device using the technique from the previous section (see Figure 4.3). Based on
finding the same adjacency information for six different channels on the device, we
2

This list includes only the first LAB. The next 12 indices use the same resources at position
X12Y3, and so forth for the remaining 24 indices at positions X12Y4 and X12Y5.

45

Figure 4.3. Figure shows the measured value of ∆t per C4 wire segment for all pairs
of wires in a C4 channel on a Cyclone IV device.

assume that all channels are similar, and collect results from experiments performed
across multiple channels. Fig 4.4 shows for each wire in the channel, the range of
∆t values indicating how much the wire delay can be changed by the value of its
neighbor. In this context, neighbor is defined as the single wire that has the largest
impact on the receiver. There is a range of values for each index because the same
measurements are taken using 6 different channels at different columns in the chip
and 5 trials for each. This result shows that, regardless of which wire is used for
routing a sensitive signal, there exists another wire in the channel with the potential
to exfiltrate that sensitive data if used as a covert receiver.

46

Figure 4.4. The distribution of observed ∆t for each wire in the channel when the
wire is used as a receiver and its neighbor is used as a transmitter.

4.3.2

Comparing Different Long Wire Types

Figure 4.5(a) shows the value of ∆t for chains of C4 wires that are combined to
create different length transmitter and receiver wires in the three devices. These
specific neighboring wires were chosen arbitrarily, but are representative of the typical coupling between neighbors (see Figure 4.4). Each line in the plot represents
an experiment performed in a single column, and the points on the line correspond
to measurements made within that column using different lengths of adjacent transmitter and receiver wires. The experiment is repeated at different columns in the
chip to produce the multiple lines whereas each measurement is repeated three times
and results are averaged to minimize noise. Cyclone IV is measured at locations
X12, X36, X60, X84, X100, and X113. Stratix V is measured at locations X12, X50,
X108, X171 and X204. Arria 10 is measured at locations X14, X60, X108, X160 and
X208. Because the three devices have different numbers of rows, the longest wire
that can be created within a column is different for each device. The lengths of wires
are given in terms of the number of LABs spanned vertically by the receiver and
transmitter. The change in propagation delay on the receiver wire is observed to be

47

(a) C4 wires

(b) C14/C16/C27 wires

Figure 4.5. Measured values of ∆t versus length of wire for three different devices.

linear in the length of the adjacency, so we consider for comparison a single value of
∆t/LAB which reflects the slope of the lines in Figure 4.5(a). We observe values of
47.8 fs/LAB, 14.0 fs/LAB, and 8.2 fs/LAB in Cyclone IV, Stratix V, and Arria 10.
Figure 4.5(b) shows an analogous plot to Figure 4.5(a) but using the longer C14,
C16, and C27 wires on the devices. Cyclone IV is measured at locations X12, X27,
X59, X107 to create the different lines. Stratix V is measured at locations X12, X50,
X108, X171 and X204. Arria 10 is measured at locations X14, X60, X108, X160 and
X208. The values we observe for ∆t/LAB in these cases are 14.6 fs/LAB in Cyclone IV,
3.9 fs/LAB in Stratix V, and 16.5 fs/LAB in Arria 10. The unknown layout strategies

48

that may be employed for each wire type prevents a carefully controlled comparison,
yet our results do show that the coupling exists across wires and designs, and that
its effect is linear in the length of the adjacent wires.

4.3.3

Technology Comparison

Figure 4.6 compares the coupling of different long wire types on Cyclone IV,
Stratix V, and Arria 10 devices. Significant leakage is observed in all devices, and
there is not a clear trend across technology nodes. It should be noted that these
devices differ not only in their process technology, but also may have different layout
strategies tailored to their technology node and intended market segment. The coupling is given in terms of ∆t/LAB as in the previous section, meaning that the given
number represents the additional increment by which the receiver is slowed down by
the transmitter value for every LAB spanned vertically by the two adjacent wires.
If any application requires routing sensitive signals on long wires in which the other
wires in the channel are untrusted, this analysis can guide a designer in deciding
whether to use a long sequence of C4 wires, or a reduced number of the longer wire
types.
The physical size of each LAB changes with technology node. Therefore, a comparison of coupling per-LAB-span on devices implemented in different technologies is
not a fair comparison of coupling per-unit-length. To consider coupling per wirelength
in absolute terms, one must adjust for technology scaling. We do this by trying to
estimate the amount of coupling on a wire span that is equivalent in length to the
LAB height in the Cyclone IV’s 60 nm technology. Assuming that LAB height scales
proportional to minimum feature size of the technology node, then the height of one
LAB in the Cyclone IV’s 60 nm technology is equivalent in height to 2.14 LABs in
Stratix V (28 nm) and 3 LABs in Arria 10 (20 nm). Adjusting by these factors yields
the data shown in Figure 4.7.

49

Figure 4.6. Values of ∆t/LAB observed using two different wire types on three
different devices.

Figure 4.7. Values of ∆t/LAB when normalized to a wire length that matches the
height of a LAB on a 60 nm technology Cyclone IV device.

4.4

Conclusion

Previous work [21, 67] has shown the existence of coupling between neighboring
long wires on both Xilinx and Intel SRAM FPGAs. In this chapter we have presented
an accurate method for quantifying the amount of coupling that exists between neighboring long wires. Our approach can detect and quantify delay changes on the order
of femtoseconds that are caused by the logic value of neighboring wires. We use the
method to characterize coupling on FPGAs in three different technology nodes including 20 nm technology. We show that coupling between long wires can be used to

50

recover adjacency information from channels if the information is not freely available
from the device vendor. Our findings show that the leakage exists and is significant
across all the FPGAs tested.
The experimentally-measured wire delays verify that length is not the only factor
contributing to ∆t values. Physical design and layout strategies may also determine
the propagation delay of the wires. Nevertheless, the experimental methods used
in this work can help designers quantify the data leakage susceptibility of sensitive
signals in their design to decide if mitigation is needed. Leakage can be avoided by
disallowing multiple tenants from sharing a single FPGA routing channel.

51

CHAPTER 5
CHARACTERIZATION OF FPGA PDN ATTACKS

On-chip FPGA voltage attacks form an important class of attacks that can be
applied to multi-tenant FPGAs. In these attacks, a malicious user controls the power
consumed in one part of the device, effectively manipulating the voltage supply for
the entire device. We show that the shared FPGA PDN allows the attacker to induce
timing faults into a co-located tenant’s circuits and effectively perform a fault injection
attack. Furthermore, aggressive power consumption can cause FPGA board reset,
thus causing a denial-of-service. To understand the threat, we characterize the voltage
response of the FPGA PDNs in three Intel FPGA families (Cyclone V, Arria 10, and
Stratix 10) and their associated boards to aggressive power consumption. Specifically,
we investigate how the amount of power consumption, the disruption time, and the
distance to the attack circuitry affect a victim circuit.
Section 5.1 describes the methodology used to characterize the FPGA PDN response to voltage attacks. The results of the characterization and our approach to
inducing instantaneous power fluctuations are described in Section 5.2. Section 5.3
summarizes the chapter.

5.1
5.1.1

Methodology and Calibration
Characterized FPGA PDNs

The Intel Cyclone V (5CSEMA5F31C6), Arria 10 GX (10AX115N2F45E1SG),
and Stratix 10 SX (1SX280HU2F50E1VG) FPGAs used for this work are located
on Terasic DE1-SoC [77], DE5a-Net [78], and DE10-Pro [79] boards, respectively.
52

Power to the DE1-SoC board is provided from a 12 V DC source. The 1.1 V internal
FPGA core voltage (VCCINT) is supplied by a Linear Technology LTC3608 stepdown switching regulator at 617 kHz through a 1 µH inline inductor. The Cyclone V
device does not include on-chip voltage sensors or hardened monitors.
The DE5a-Net is equipped with a Texas Instruments TPS40422 switching regulator, which steps down the 12 V DC input voltage to 0.9 V (VCCINT) and supplies
power to the Arria 10 GX device at 300.75 kHz through a 0.47 µH inductor. The Arria
10 GX device includes an on-chip voltage sensor and a temperature sensing diode,
allowing a user to monitor the core voltage and die temperature. Both sensors are
located in the upper, middle of the die. Power to the DE10-Pro board is provided
from a 12 V DC source. The 0.9 V internal FPGA core voltage (VCCINT) is supplied
by a Linear Technology LTM4677 step-down regulator switching at 425 kHz. Two
Linear Technology LTC2945 power monitor chips were used to track the 12 V board
input supply and 0.9 V FPGA core voltages.
A schematic of a typical on-chip FPGA PDN is shown in Figure 5.1. Although
publicly available information about on-FPGA PDNs is limited, the PDN performance of several SRAM-based Xilinx FPGAs is characterized in Klokotov et al. [36].
FPGA PDN impedence characteristics were examined by Zhao et al. [90]. Power is
supplied through the inductor and distributed to core voltage inputs of the FPGA.
The resistance and capacitance of the PCB traces and on-die PDN network allow
localized voltage fluctuations to occur within the chip, such that different parts of the
fabric may have different supply voltages at the same time [36].

5.1.2

On-die Voltage Sensors

A voltage monitoring system is needed to observe the PDN response to adversarial power consumption during an attack. To determine on-chip voltage, we measure
the voltage at selected positions of the PDN using RO-based voltage sensors. The

53

FPGA

i

...

Onboard
Regulator

...

...

+
VL = L di/dt

Figure 5.1. On-chip FPGA power system. A voltage drop occurs across the inductor
due to di/dt. A steady-state voltage drop occurs in the PDN due to its resistance.

Figure 5.2. Schematic of the RO-based voltage sensor.

frequency of each oscillator decreases due to voltage drops, and a calibration procedure is required to learn the correspondence between voltage and RO frequency.
After calibration, frequency measurements made at each sensor can be translated
into associated voltages.
Figure 5.2 illustrates the layout of the sensor architecture. The sensors are placed
on the die forming a regular rectangular grid which is sufficient to perform power
analysis attacks [89]. Each sensor consists of a 19-stage RO triggering a 20-bit fre-

54

quency counter. With 19 inverting stages, the remaining design meets the timing
constraints, local delay variations are minimized [74], and RO stacking can be used in
a single logic array block (LAB). Although shorter ROs are possible by inserting open
latches in the ring to increase the path delay [92], the lack of built-in latch elements in
the selected devices makes this technique unsuitable. The 19 inverting stages of the
RO design shown in Figure 5.2 achieve an average frequency of 105 MHz for Cyclone
V, 150 MHz for Arria 10, and 130 MHz for Stratix 10. Measurement periods were 2 µs
and 1 ms for the Arria 10 and Stratix 10 sensor calibration, respectively, described
in Section 5.1.3. For all the other conducted experiments, unless otherwise stated,
measurement periods of 10 µs were used. The 10 µs period provides the capability to
detect 0.1% frequency changes, corresponding to a sub-millivolt resolution in supply
voltage measurement. We found that the chosen experimental settings provide sufficient resolution for voltage characterization tests without complicating the design of
the sensor. Although counting oscillations during a 10 µs period can give an accurate
estimate of voltage, it does not accurately capture short transient voltage drops with
much shorter duration. In Chapter 6 we will show that fast transient drops can be
observed using path delay circuits which are similar to TDCs.

5.1.3

Voltage Calibration

Since the Cyclone V device on the DE1-SoC board does not include an on-chip
voltage sensor, an alternate approach was needed to correlate RO count with voltage.
To control the voltage when calibrating the sensors, we desoldered the switching
regulator and its output inductor from one DE1-SoC board and supplied the FPGA
core voltage to that board directly from a Keysight E36312A benchtop power supply.
We varied the supplied voltage and at each step measured the FPGA input voltage
with a Keysight MSOX4154A oscilloscope and also recorded the frequency of the
sensors using test logic on the FPGA. To prevent any localized voltage drops and

55

Cyclone V

1.0

Voltage [V]

Voltage [V]

1.1

0.9
0.8

0.90
0.88
0.85
0.82
0.80
0.78

0.4
0.6
0.8
1.0
Normalized Frequency

Arria 10
Stratix 10

0.88 0.92 0.96 1.00
Normalized Frequency

(a) Cyclone V.

(b) Arria 10 / Stratix 10.

Figure 5.3. Figures show the experimentally derived Cyclone V and Arria 10 calibration curves, which relate frequency changes to the supply voltage values that account
for them. The frequency of a sensor is inversely proportional to the propagation delay
of the oscillating signal.

ensure that the measured voltage matches the voltage at the sensors, only the test
logic and sensors are active during calibration, which minimizes the power drawn by
the FPGA. Figure 5.3(a) shows the measured correspondence between voltage and
frequency of the sensors. The measurements from the RO sensors exhibit a consistent
trend across voltages, and the same trend is observed on all sensors, allowing us to
calibrate the relationship between voltage and normalized frequency. Unless otherwise
noted, all other DE1-SoC experiments used an unmodified board powered by the onboard switching regulator and output inductor.
Unlike the Cyclone V device, the Arria 10 and Stratix 10 FPGAs are equipped
with on-chip voltage sensors [30,32] that can be used to calibrate the RO sensors. In a
series of calibration experiments, we varied the number of power wasters (Figure 5.4)
placed and triggered on the Arria 10 and Stratix 10 devices from 8,000 up to 30,000,
while monitoring readings from both the on-chip voltage sensor and an RO sensor
adjacent to it. Turning on a different number of wasters at each step causes a variation

56

in the reported RO counts and measured voltages by the on-chip sensor allowing us
to identify the relationship between RO frequency and voltage on the Arria 10 and
Stratix 10 devices. The resulting Arria 10 and Stratix 10 calibration curves are shown
in Figure 5.3(b). The curves exhibit a similar trend to the one extracted from the
Cyclone V device. The DE5a-Net (Arria 10) and DE10-Pro (Stratix 10) boards were
unmodified for all experiments, including calibration.

5.1.4

Minimizing Temperature Effects

Although RO operation can potentially influence chip temperature, voltage gradients have a much more immediate impact on the measured RO delay than temperature [8, 84]. To minimize heating effects, our experiments were conducted using
sampling periods in the sub-millisecond range (e.g., less than 10 µs) with no more than
a hundred samples taken each time. An idle period of a few seconds between iterations
was introduced. The ambient temperature during the calibration and characterization experiments was kept at 24 ◦C. Neither the on-board nor on-chip temperature
sensors of the Arria 10 and Stratix 10 systems reported temperature fluctuations during the calibration process. This result indicates that thermal effects are negligible
in our characterization.

5.2
5.2.1

On-chip FPGA PDN Attack
Adversarial Power Consumption Circuit

We assume that an application on one part of the FPGA is adversarial and implements a design capable of high-power consumption to disturb the PDN. For initial
experiments, an area of 1,408 LABs (44 rows by 32 columns) was arbitrarily chosen
as a representative example of the Cyclone V FPGA real estate an adversary might
occupy, which is 32.8% of the total LABs on the chip. To evaluate the Arria 10 PDN,
an area of 11,424 LABs (168 rows by 68 columns) was arbitrarily allocated to the

57

Figure 5.4. Power waster circuit mapped to a Cyclone V / Arria 10 / Stratix 10
adaptive logic module (ALM) device.

adversary, occupying 23% of the FPGA real estate. Similarly, for the evaluation of
the Stratix 10 PDN, we arbitrarily allocated to the adversary an area that occupies
6% of the FPGA real estate (6,656 LABs, 104 rows by 64 columns). Dynamic power
is maximized by circuits with a high amount of switching; therefore, we allow the adversary to instantiate various quantities of single-stage ROs as power waster circuits.
Figure 5.4 shows an adaptive logic module (ALM) implementing two power wasters.
Up to 20 power wasters can be implemented in each Cyclone V, Arria 10, or Stratix
10 LAB. When instantiating a desired number of power wasters, a script places them
uniformly at random locations throughout the allocated region.
The power consumed by each instance in both examined devices is shown in Table 5.1. Power consumption in the Cyclone V device was measured using the modified
DE1-SoC board and benchtop supply. Note that the power consumed per instance is
diminished as the number of instances grows. This result occurs because the power
wasters cause a local drop in supply voltage which slows down their oscillation (reducing fSW in Eq. (5.1)) and causes the switching to occur at lower voltage (reducing
2 ). Although our later experiments use up to 12,000 power waster instances on a
VDD

58

Table 5.1. Power consumed by each RO-based power waster instance.
Cyclone V
PW
Power /
instances Inst. [mW]
160
1.13
1,600
1.02
3,200
0.91
4,800
0.84
6,400
0.75

Arria 10
PW
Power /
instances Inst. [mW]
12,000
2.17
16,000
2.18
20,000
2.21
24,000
2.18
28,000
2.20

Stratix 10
PW
Power /
instances Inst. [mW]
14,000
2.49
18,000
2.50
22,000
2.48
26,000
2.31
30,000
2.30

Cyclone V located on an unmodified board, Table 5.1 ends at 6,400 because the 5 A
current limit is reached on the benchtop supply that powers the modified board.

2
pdyn = fSW · VDD
· C

(5.1)

Power consumption in the Arria 10 device was measured using an unmodified
DE5a-Net board via an on-board Texas Instruments INA231 [80] power monitor chip
on the 12 V supply. Unlike the Cyclone V results, increasing the number of wasters
in the Arria 10 device appears to have a negligible impact on the power consumed
by each instance (two columns in the middle of Table 5.1), although this finding is
inferred from 12 V power measurement and is therefore less direct than the Cyclone V
measurements. The INA231 reported that the power consumed reached 78 W when
28,000 wasters were activated. Beyond that point, attempts to further increase the
number of instances caused a crash and the loss of the FPGA configuration image.
Power consumption in the Stratix 10 device was measured using an unmodified
DE10-Pro board via an on-board Linear Technology LTC2945 [44] power monitor
chip. Unlike the INA231 on the DE5a-Net board (Arria 10), the LTC2945 is attached
directly to the 0.9 V FPGA supply voltage. This approach reveals a trend similar
to the Cyclone V measurements where the power consumption per instance slightly
diminishes as the number of RO instances grows. Activating 30,000 instances causes

59

a loss of JTAG communication between the host PC and the board. The LTC2945
reported that while the wasters were active with 30,000 instances, the total power
consumed by the FPGA device reached 98 W. Attempting to activate more than
30,000 instances results in an immediate loss of device operation. In the case of
either communication loss or an immediate crash, a hard reset is insufficient to make
the board accessible again and a manual power cycle is required.
It is important to note that the voltage sensors in this work are only used to
measure the effects of the power consumption circuits. Voltage sensors are calibrated
and used to measure the on-chip voltage at various time points at locations across
the die surface of the FPGA. Such information is used to characterize the effects of
the power consumption and potentially perform remediation.

5.2.2

Physical Characterization of Voltage Drop

To evaluate the PDN response of the devices to high-power consumption, experiments are performed with sensors placed at various distances away from the attack
circuitry. In the Cyclone V device, 12,000 power wasters turn on at time 0 and the
frequency of the sensors, or equivalently their supply voltages (Figure 5.5(a)), drop in
response to the attacker’s power consumption. The supply voltage measured by each
sensor initially drops, undershoots, and then settles back to a steady-state voltage
that is lower than the nominal 1.1 V for as long as the power wasters remain active.
At the center of the power consumption area, the supply voltage drops to a minimum
of 811 mV and reaches a steady state of 846 mV. Sensors farther away observe a
similar behavior but a smaller magnitude of voltage drop.
Similarly, 28,160 power wasters are placed on the Arria 10 fabric and are simultaneously activated while 12 sensors, placed at different distances to the center of the
attack, capture the PDN response. The measured voltage at the 12 different sensor
locations is shown in Figure 5.5(b). To a greater extent than in Cyclone V, the volt-

60

0.9
0.8
0.7
0.6
0.5
0.4

Voltage [V]

Normalized RO Counts

1.0

1.10
1.05
1.00
0.95
0.90
0.85
0.80

53 LAB cols away
46 LAB cols away
41 LAB cols away
34 LAB cols away
29 LAB cols away
22 LAB cols away
0 LAB cols away

200

100

0

100
Time [ s]

200

300

0.95

1.02
1.0
0.96

0.90
Voltage [V]

Normalized RO Counts

(a) Cyclone V

0.92
0.88

0.85
0.80
0.75 200

177 LAB cols away
154 LAB cols away
127 LAB cols away
101 LAB cols away
88 LAB cols away
68 LAB cols away
55 LAB cols away
45 LAB cols away
35 LAB cols away
25 LAB cols away
15 LAB cols away
5 LAB cols away

100

0
100
Time [ s]

200

300

200

300

(b) Arria 10

1.0
0.96
0.92
0.88

0.90
Voltage [V]

Normalized RO Counts

0.95

0.85
0.80
0.75 200

181 LAB cols away
158 LAB cols away
135 LAB cols away
110 LAB cols away
89 LAB cols away
87 LAB cols away
63 LAB cols away
58 LAB cols away
39 LAB cols away
30 LAB cols away
16 LAB cols away
7 LAB cols away

100

0
100
Time [ s]

(c) Stratix 10

Figure 5.5. Normalized RO sensor counts (left axis) and their corresponding voltages
(right axis) measured by sensors before and during a power wasting attack that begins
at time 0. The legend shows the distance between each sensor and the center of the
power wasting region.

61

age drop in Arria 10 is followed by an overshoot before settling back to a steady-state
voltage. The sensor farthest from the center of the attack observes a peak-to-peak
voltage swing of 125 mV, corresponding to 14% of the nominal 0.9 V supply voltage.
In the Stratix 10 device, we can increase the number of activated wasters to 30,000
and still extract sensor data. Figure 5.5(c) shows the Stratix 10 PDN response as
captured by 12 sensors placed when 30,000 power wasters are simultaneously activated. The supply voltage at the center of the power consumption area drops to a
minimum of 800 mV and reaches a steady state of 863 mV. The magnitude of the
voltage drop in the Arria 10 and Stratix 10 devices becomes smaller with increasing
distance to the power wasting region. This result is consistent with the Cyclone V
observations shown in Figure 5.5(a).

5.2.2.1

Varying the Amount of Power Consumed

As one might expect, attacks wasting more power cause larger voltage drops. The
voltage drops are observed at the site of the attack and also in the surrounding area
of the die. Figure 5.6 shows voltage plotted against distance from the center of the
attack on the Cyclone V device; each line in Figure 5.6 corresponds to a different
number of power wasters being instantiated and used in the attack. We can observe
in each attack that the supply voltage change can have a far-reaching impact on
other circuitry. Even 53 columns away from the center of attack, the supply voltage
is reduced from 1.1 V to 967 mV in the strongest attack.

5.2.2.2

Role of the Inductor in Undershoot

The voltage undershoot observed in Figure 5.5 is caused by the large and sudden
change in the current drawn from the FPGA core supply when the power wasters
all turn on simultaneously. The sudden change in current creates a voltage drop
across the inline inductor of the switching regulator, which thereby reduces the voltage supplied to the chip (Eq. (5.2)). Figure 5.7 shows the core voltage dropping in

62

1.10

Voltage [V]

1.05
1.00
0.95
0.90

160 PWs (113 ALMs)
1.6K PWs (1.1K ALMs)
3.2K PWs (2.3K ALMs)
4.8K PWs (3.4K ALMs)
6.4K PWs (4.5K ALMs)

0.85
0.80

0

10

20

8.0K PWs (5.7K ALMs)
9.6K PWs (6.8K ALMs)
11.2K PWs (7.9K ALMs)
12.0K PWs (8.5K ALMs)

30

40

Distance to center of PW [LAB columns]

50

Figure 5.6. Voltage change across distance for various number of power wasters
instantiated in the Cyclone V device.

the DE1-SoC board when 12,000 power wasters turn on, as captured by a Keysight
MSOX4154A oscilloscope. The waveform shows that at its minimum peak the supply
voltage drops by 85 mV. Integrating the measured inductor voltage with respect to
time shows that the current draw is increasing by more than 2.5 A within just 60 µs.

Vcore = Vreg − VL = Vreg − L

di
dt

(5.2)

The 85 mV voltage drop measured across the inductor impacts every part of the
FPGA that shares the same supply, which can allow an attacker to affect victim
circuits regardless of their position on the chip. Unlike the L(di/dt) drop, the iR
voltage drop due to resistances in the PDN depends only on the current, and not
on the change in current. Therefore, L(di/dt) drop is maximal when the current is
changing, and iR drop is maximal after the current has changed; therefore, they do
not both contribute their peak values at the same time. The largest total voltage
drop is observed to be a combination of L(di/dt) drop from the inductor combined
with a iR drop of the power grid. At the same time that the core voltage is being
63

Voltage [V]

Figure 5.7. Voltage drop from the inductor in DE1-SoC (Cyclone V), measured at
test pad VCC1P1.

1.10
1.05
1.00
0.95
0.90
0.85
0.80

99mV

70mV

254mV

62mV

Before Attack
Undershoot
Steady State

35mV

0

63mV

10
20
30
40
50
Distance to center of PW [LAB columns]

Figure 5.8. Voltage versus sensor position on the Cyclone V chip.

measured on the board using the oscilloscope (Figure 5.7), sensors are measuring the
internal voltage at different locations on the chip. At each sensor location, we extract
the minimum voltage reached and the voltage reached in steady-state when the power

64

wasters are active and the current is constant, which is purely an iR drop. Figure 5.8
plots these two voltages against the distance between sensor and center of power
consumption. In the sensor farthest from the power consumption, the PDN voltage
has a relatively small steady state iR drop. Due to the inductor, the minimum voltage
reached is 70 mV below steady state, which is almost the full 85 mV drop observed
on the oscilloscope measurement.

5.3

Summary

In this chapter, in a series of experiments, we characterize voltage drops in the
on-chip FPGA PDN caused by the activation of a fraction of the available logic as
power wasters. DE1-SoC, DE5a-Net, and DE10-Pro boards, containing Intel Cyclone
V, Arria 10, and Stratix 10 FPGAs, respectively, are used to evaluate the effects. Our
experiments show that voltage drops caused by inductance (L(di/dt)) can be used to
target tenants located far from the power wasting area. In the next chapter, we show
how a malicious user can orchestrate such voltage drops to induce timing faults in
the circuits of a co-located tenant.

65

CHAPTER 6
CAUSING FAULTS VIA PDN MANIPULATION

A decrease in supply voltage causes an increase in the propagation delay of combinational logic. Path delay faults will be caused by a reduced supply voltage if the
propagation of the combinational results do not satisfy the setup time requirements of
the capturing flip flops. Having shown that aggressive power consumption can cause
a far-reaching drop in supply voltage, we now examine whether the voltage drop can
induce path delay faults in a victim circuit. For simplicity, we use ripple-carry adders
as test circuits since their carry chains can provide differing path lengths. The type of
characterization discussed in this chapter focuses on Cyclone V and Arria 10 devices.
The experimental setup used in this chapter for timing fault characterization revealed
no faults when deployed on a Stratix 10 device.
The timing-fault characterization we performed on Cyclone V and Arria 10 devices
is discussed Section 6.1. In Section 6.2, we correlate the fault susceptibility of a
sensitized path to voltage. Section 6.3 discusses the potential impact of two wellknown side-channel attack mitigation techniques on delay fault attacks. The chapter
is summarized in Section 6.4.

6.1

Demonstration of Path Delay Faults

Our first path delay experiments use 12,000 power wasters within a block of 1,408
LABs in a Cyclone V device and 28,160 wasters within a block of 11,424 Arria 10
LABs. The victim (i.e., the ripple carry adder) has been hand placed adjacent to the
attack area in a single LAB column, which in the Cyclone V and Arria 10 experiments
66

is 23 and 38 LAB columns away from the center of the attacker, respectively. A
script generates vectors that sensitize paths with slack ranging from +3 ns to −2 ns
in the Cyclone V device and from +0.2 ns to −0.5 ns in the Arria 10 device. The
timing slack of each path in an adder instance is reported using the TimeQuest
Timing Analyzer [29]. The slow 1100 mV 85 ◦C model is used for the Cyclone V
implementation of the adder and the slow 900 mV 100 ◦C model for the Arria 10
implementation. The vectors are repeatedly applied during power attacks, and a log
is kept with the faults and their timestamps.
Figure 6.1 shows the faults that occur from the attack. The X and Y coordinates of
each point denote the time and reported slack of the path on which the fault occurred,
respectively. Paths with more slack are less susceptible to delay faults. Every point
on the plot depicts the capture of an incorrect result. Red points denote faults on
paths with positive slack, which are paths that meet timing constraints according to
the conservative timing model. Blue points originate from paths that have negative
slack according to the conservative timing model, but are error free in the absence of
an attack.
The results in Figure 6.1 indicate that in both devices there is a period in which
faults occur (e.g., 10 µs to 20 µs for the Cyclone V device and 5 µs to 10 µs for the Arria
10 device). The Arria 10 results (Figure 6.1(b)), however, show an additional peak of
faults immediately following the enabling of the wasters. These faults are attributed
to the initial response of the DE5a-Net/Arria 10 PDN to the sudden activation of the
wasters that led to a large but brief voltage drop, also observed by Zick et al. [93] in
a Xilinx Kintex-7 device.
As discussed in Section 5.2.2.2, the simultaneous activation of all the power wasters
causes a large and sudden change in the current drawn by the FPGA. Figure 6.2 shows
the core voltage dropping in DE1-SoC and DE5a-Net boards when the power wasters
turn on, as captured by a Keysight MSOX4154A oscilloscope. In the Cyclone V

67

(a) Cyclone V delay fault test. Adder is placed (b) Arria 10 delay fault test. Adder is placed
23 LAB columns away from the center of the 38 LAB columns away from the center of the
attack.
attack.

Figure 6.1. Delay faults on adder circuits placed outside the wasting area when the
adversary at time 0 turns on 12,000 and 28,160 power wasters in Cyclone V and Arria
10 devices, respectively. X-coordinate denotes the time the fault occurred during the
attack. Y -coordinate is the reported timing slack of the exercised path.

device, the waveform in Figure 6.2(a) shows that the peak voltage drop of 85 mV
occurs roughly 16 µs after the power wasters turn on. In the Arria 10 device (Figure 6.2(b)), the peak voltage drop of 38 mV occurs roughly 8 µs after the activation
of the wasters. For each device, the timing of the minimum voltage as measured on
the scope (Figure 6.2) corresponds to the timing of the minimum voltage observed in
on-chip sensors (Figures 5.5(a) and (b)), and the time at which the most severe delay
faults occur (Figure 6.1).
The 85 mV and 38 mV peak voltage drops measured across the inline inductors
and shown in Figure 6.2 impact every part of the FPGA that share the same supply.
This characteristic allows a malicious tenant to target victim circuits regardless of
their distance to the attack circuitry. To examine the spatial impact of the on-chip
voltage drop, we placed ripple-carry adders in the Cyclone V device at distances 23,
26, 31, 35, 37, 40, 44, 47, and 52 LAB columns away from the center of the power

68

(a) DE1-SoC/Cyclone V: voltage drop measured at test pad VCC1P1.

(b) DE5a-Net/Arria 10: voltage drop measured at the positive terminal of on-board
decoupling capacitor labeled as C371.

Figure 6.2. Turning on power waster circuit causes a large instantaneous change
in current. The instantaneous change causes a voltage drop on the off-chip inductor
which effects all parts of the chip.

waster region. Similarly, in the Arria 10 device, we instantiated adders at distances
38, 48, 60, 70, 76, 87, 97, 107, 118, 138, 148, and 160 columns from the region center.
Figure 6.3(a) shows that in the Cyclone V device the attack causes faults on legal
paths with positive timing slack that are 40 LAB columns away from the center of

69

(a) Cyclone V with victim 23-40 LAB columns away from the center of the attack.

(b) Arria 10 with victim 38-160 LAB columns away from the center of the attack.

Figure 6.3. Examining timing faults at different distances between the adder and center of attack in Cyclone V and Arria 10
devices.

70

the wasting area. The attack impact gradually diminishes with increased distance
from the waster (Figure 5.5(a)). Adders placed farther away exhibit fewer faults.
Figure 6.3(b) focuses on the first 1 µs of the attack in the Arria 10 device. Although
the impact of the attack weakens at increasing distance from the wasters, faults in
paths with positive slack are observed at all examined distances. Since faults were
induced on legal paths at the device outskirts in both tested devices, it is apparent
that spatial isolation between tenants is insufficient to protect against PDN attacks
in multi-tenant FPGA applications.

6.2

Relating Voltage and Timing Slack to Fault Sensitivity

Having demonstrated the capability to cause delay faults, and characterizing PDN
voltage in response to power consumption, we now connect the two by using the
Cyclone V device to show experimentally the combinations of slack and voltage that
lead to faults. In this experiment, 1,024 random attack scenarios are created and
implemented by choosing at random the following parameters:
• The position of the victim adder circuit (between 23 and 53 columns from center
of attacker).
• The sensitized path of the victim adder (uses between 53 and 64 stages of carry
logic implemented on the hardened carry circuitry of the FPGA).
• The number of power wasters used by the attacker (between 3,200 and 12,000
instances).
The minimum voltage at the victim circuit during each attack is inferred by interpolation on the data shown in Figure 5.6 according to the victim location and
number of power wasters in the attack. As in the prior subsection, the path is repeatedly sensitized during the attack, and the result is checked for faults. Red and

71

Reported slack [ns]

4
2
0
2
4
6
0.90

0.95
1.00
1.05
Minimum voltage at path [V]

1.10

Figure 6.4. Scatter plot shows which randomly generated attack scenarios caused
faults and which did not. X-coordinate denotes voltage in victim circuit during attack.
Y -coordinate is the reported timing slack of path exercised during attack.

green marks in Figure 6.4 denote attack scenarios in which faults did or did not occur,
respectively. The X-coordinate of each point is the minimum voltage at the victim
during the attack. The Y -coordinate of each point is the timing slack of the victim
path as reported by the TimeQuest Timing Analyzer using the slow 1100 mV 85 ◦C
timing model.
Timing models are conservative with respect to operating conditions and process
variation, and the effects of the conservative timing model can be seen in Figure 6.4.
Paths reported as having 0 slack are typically fault free even when their voltage
drops by 140 mV, although Figure 5.3(a) shows that a 140 mV drop should cause a
significant increase in propagation delay.
The pattern of faulty and fault-free points in Figure 6.4 shows that voltage and
timing slack are largely sufficient to explain which adder paths will experience faults
during an attack. This finding supports the supply voltage drop being the cause of

72

the fault and not some other artifact of power consumption. The results also show
that conservative timing models provide some inherent margin against attack.

6.3

Relationship to FPGA Logic Isolation and Active Fencing

The results shown in Figure 6.3 indicate the ability of power wasters to induce
faults over a wide extent of the FPGA die. The shared nature of the supply voltage
PDN on the FPGA causes the voltage drop induced by the wasters to impact supply
voltage across the chip, even though the waster and target logic components are
logically isolated. Although leading FPGA companies, such as Xilinx and Intel, allow
for the isolation of logic design subcomponents [5, 85], our results and those of other
groups indicate that logic isolation is not effective in preventing these types of attacks.
Recently, Krautter et al. [38] proposed to use ROs to perform active fencing against
side-channel attacks on encryption cores. In these experiments, voltage sensors are
used to identify small changes in on-chip voltage due to encryption operations. These
perturbations are then used to identify the encryption key. To prevent such attacks,
ROs located close to the encryption core are enabled, making it difficult to identify
the small voltage changes induced by the encryption core. In our attack, the ROs
used by active fencing would enhance the attack, rather than prevent it. The active
fence would further reduce on-chip voltage to induce additional faults in the target
circuit.

6.4

Summary

In this chapter, we show that an FPGA user can aggressively consume power to
manipulate the FPGA supply voltage and induce timing faults in co-located circuits.
In a multi-tenant scenario, this finding indicates that protection mechanisms that are
solely based on the spatial isolation between users are not likely to be effective. The
73

next chapter demonstrates how a malicious user can manipulate the FPGA supply
voltage to inject faults into the circuit of an unsuspecting co-located tenant and
extract secret information from it.

74

CHAPTER 7
EXPLOITING VOLTAGE DROPS FOR SECURITY
BREACHES

Chapter 6 showed that in a multi-tenant setting a malicious tenant can manipulate
the FPGA supply voltage to induce delay faults in another tenant’s circuits. In this
chapter, we demonstrate the breadth of the threat by performing a fault injection attack on an FPGA-based crypto circuit that implements the Rivest-Shamir-Adleman
(RSA) algorithm. The attack is able to extract private RSA keys from the faulty
outputs produced by the circuit. Unlike previous approaches, our attack does not
require any modifications to the encryption core, nor power wasting that is synchronized with the execution of specific rounds of the encryption operation. The attack
is performed remotely and does not require physical access to the device.
Section 7.1 provides the background for the RSA cryptosystem and Section 7.2
describes the architectural details of the RSA FPGA implementation we developed
for experimentation. The RSA key recovery attack is discussed in Section 7.3. The
chapter concludes in Section 7.4.

7.1

RSA Cryptosystem Background

The RSA cryptosystem [70], based on an asymmetric cryptographic algorithm proposed in 1977, is still widely used for secure data transmission. The RSA encryption
of a message involves the computation of the modular exponentiation Y = X e mod N,
where X is the message to be encrypted, e is the public exponent, N is the RSA modulus, and Y is the resulting ciphertext. Identically, the decryption function is described

75

as X = Y d mod N, where d is the private exponent. The public exponent e along
with the modulus N compose the public key k pub = (N, e) which can be known by
everyone and used for encrypting messages. The private exponent d is kept secret as
the private key k pr = (N, d) and used for decryption. The RSA modulus N is used
as the modulus for both the public and private keys and is computed as N = p · q,
where p and q are two large, randomly generated prime numbers. The primes p and
q are usually 512 to 2,048 bits long and must be kept secret.
Performing modular exponentiation with large exponents can become impractical
for conventional processors and hence modern systems often use dedicated hardware
to accelerate the computationally expensive operations. One common technique to
further speed up modular exponentiation of long numbers is based on the Chinese
Remainder Theorem (CRT). The CRT can be applied for encrypting a message X as
follows:

Y = X e mod N = (a · q) · Yp + (b · p) · Yq mod N

(7.1)

where a, b are predefined constants and Yp , Yq are computed as:
Yp = (X mod p)(e
Yq = (X mod q)

mod p−1)

(e mod q−1)

mod p
(7.2)
mod q

CRT avoids performing computations with the full-length exponents e, d, and
modulus N, and instead performs two separate and faster modular exponentiations
with numbers bounded by the “shorter” prime numbers p and q (see Eq. (7.2)). In
the last step of CRT, the two partial results Yp and Yq are combined (see Eq. (7.1)) to
construct the encrypted message Y . CRT exponentiation is shown to be four times
faster than direct exponentiation [56] but can only be used by the party who possesses
the private key k pr and two prime numbers p and q.

76

7.2

Hardware Implementation

To investigate if the deliberately caused fluctuations of the FPGA’s core voltage
can reveal the RSA private key when the CRT is used, we implemented a parameterizable RSA core on the Cyclone V device. The CRT-based RSA core consists of
a single modular exponentiation unit and control-path state machine for calculating
sequentially Yp and Yq , described in Eq. (7.2). Modular exponentiation is realized
using the standard square and multiply algorithm and a Montgomery multiplier for
eliminating the requirement for the division operation [18]. Interfacing with the host
PC is accomplished through a JTAG-accessible on-chip memory that controls the
RSA core, writes in its inputs (e.g., p, q, d, and X), and reads out its outputs (e.g.,
Yp and Yq ).
Table 7.1 shows the resource utilization and maximum clock speed of the RSA
core for three different key lengths. The critical path of the architecture resides in
the control-path state machine implementing the Montgomery multiplier. The 128bit implementation on average completes a single RSA operation (e.g., encryption
or decryption) in 0.59 ms (1,695 ops/s). Due to the combined use of the squareand-multiply method and Montgomery multiplication, doubling the length of the key
quadruples the required clock cycles for a single RSA operation. In addition, the larger
designs must be clocked at lower frequency. The 256- and 512-bit implementations
complete on average a single RSA operation in 3 ms (333 ops/s) and 14.68 ms (68
ops/s), respectively.

7.3

Fault Injection Attack

Cloud FPGAs have steered attention to a new class of attacks that require neither
physical device access nor expensive laboratory equipment. It has been shown that
FPGA implementations of AES are susceptible to both power analysis attacks [71]
using on-chip sensors and fault injection [39] using power wasters. RSA implemen-

77

Table 7.1. Resources used in RSA core and corresponding reported Fmax for the
three supported key lengths in the Cyclone V device.
Key

ALMs

Flip-flops

Memory

Fmax

Length

(Avail.: 32,070)

(Avail.: 128,280)

(Avail.: 3,970 Kb)

[MHz]

128-bit

1,236 ( 3.9%)

1,925 (1.5%)

16 Kb (<1%)

94.74

256-bit

2,003 ( 6.2%)

3,463 (2.7%)

16 Kb (<1%)

77.12

512-bit

4,030 (12.6%)

6,537 (5.0%)

16 Kb (<1%)

61.50

tations have been exposed to power analysis attacks [89] where the private key was
successfully extracted using power traces meticulously captured by on-chip ring oscillators. In contrast, fault injection attacks, like the one described in this section, do
not require calibrated sensors to detect information leakage.
Boneh et al. [12] showed that arbitrary errors in computations of RSA with CRT
make the factorization of the modulus N feasible. It can be shown that if an error
occurs while computing one of the modular representations Yp or Yq (see Eq. (7.2))
then the secret prime numbers p, q can be recovered. Lenstra [41] showed that if
the RSA input is known, then the prime numbers can be recovered using only faulty
outputs. Assuming a faulty Yp , Lenstra’s approach provides q as gcd(X −Ybe, N), where
X is the RSA input, Yb is the faulty output, and e is the public exponent. Then, the
private exponent d can be derived as d = e−1 mod ((q − 1) · ( Nq − 1)). Similarly, a
Y output composed using a faulty Yq reveals the prime number p and the private
exponent d can be then derived as d = e−1 mod ((p − 1) · ( Np − 1)). Note that since
N = p · q the attacker needs to know neither which prime number (p and q) has been
exposed nor which of the partial results (Yp and Yq ) is incorrect. As long as only one of
the two partial results composing the final output Y is incorrect, a single interaction
with the cryptosystem is sufficient for extracting the private key k pr = (N, d).
A potential scenario where this attack may apply involves a Certification Authority
(CA) service that uses hardware to accelerate an RSA signing. Assume that an

78

adversary who can induce faults in the hardware of the service requests a certificate
and then sends message X to be signed. If the returned certificate Y is assembled with
an erroneous Yp or Yq , then CA’s private key k pr = (N, d) will be exposed. That is, the
adversary is in possession of the already known CA’s public key k pub = (N, e), initial
message X, and a faulty output Y . By applying Lenstra’s approach the adversary
can extract one of the two primes and use it to reconstruct CA’s private exponent
d, which then can be used for issuing fake certificates. A similar use-case scenario is
discussed in Pellegrini et al. [57] where, however, they attack a SPARC-based RSA
implementation by manipulating the supply voltage of the system in order to inject
faults.
In the Cyclone V device, Lenstra’s approach is put to the test by instantiating
the RSA core described in Section 7.2 along with power wasters placed at random
locations in the surrounding area. A script running on the host PC generates a set of
RSA variables (e.g., message X and keys k pub , k pr ), passes to the RSA core its inputs,
triggers an RSA operation, and activates the wasters. When the RSA operation is
over, the script reads the output of the RSA core and attempts to extract the private
key k pr using Lenstra’s approach. A precompiled library containing bitstreams with
various numbers of wasters is used to examine different attack magnitudes. Each
bitstream undergoes 50 trials using randomly generated RSA inputs, and a log is
kept with the outcome of each attempt.
The activation of the wasters during the RSA operation can result in three outcomes: 1) the attack has no impact on the RSA core and, thus, it outputs the expected
Y ; 2) the attack induces timing faults resulting in a faulty output Y which reveals
the private key k pr ; and 3) the voltage drop due to the attack triggers an undesirable
board reset and loss of the FPGA configuration image. The probability of these three
outcomes are summarized in Figure 7.1. The x-axis denotes the number of wasters
that are activated during the RSA operation, blue corresponds to the probability of

79

Probability

1.00

Extracting key
Resetting board

0.75
0.50
0.25
0.00

6

7

8

9
10
11
Number of PWs [k]

12

13

(a) 128-bit RSA core

Probability

1.00

Extracting key
Resetting board

0.75
0.50
0.25
0.00

6

7

8

9
10
11
Number of PWs [k]

12

13

(b) 256-bit RSA core

Probability

1.00

Extracting key
Resetting board

0.75
0.50
0.25
0.00

6

7

8

9
10
11
Number of PWs [k]

12

13

(c) 512-bit RSA core

Figure 7.1. Stack plot shows the probability of different outcomes when attacking
RSA using various numbers of wasters instantiated in the Cyclone V device. Successfully extracting the RSA private key constitutes the blue part of the plot. Unwantedly
resetting the board due to the attack constitutes the orange part of the plot.

80

successfully extracting the private key k pr , and orange is the probability of resetting the board. Although in theory, the attack should work for any key length, we
examined the three key lengths (128-, 256-, and 512-bit) discussed in Section 7.2.
The datapath for each key length is found to be susceptible to fault injection. The
probability of successfully extracting the key is maximized when activating roughly
11k-12k wasters and beyond this number of wasters the board typically resets during
an attack.

7.4

Conclusion

In this chapter, we showed the breadth of the risks involved in FPGA multitenancy by performing a power-based fault injection attack on an RSA cryptosystem.
The attack reveals the private key without requiring synchronization with specific
encryption rounds or embedding Trojans within the core. This attack is performed
remotely and does not require physical access to the device.

81

CHAPTER 8
POWER WASTERS FOR CLOUD FPGA ATTACKS

In this dissertation and other related works, it has been conclusively shown that
ring oscillators (ROs) can disturb the FPGA supply voltage to induce timing faults [39,
48,59] and/or board resets [26]. Since the construction of ROs deviates from the synchronous design principles used by most design logic, they could be easily identified
by diagnostic tools searching for malicious circuits. An open question is whether a
more “common” circuit structure, without extremely high frequency clocks or short
oscillation paths, can also be used in on-chip FPGA power attacks.
In this chapter, we introduce a new power wasting circuit that is based on a
standard AES encryption round. The circuit operates at low frequency and does
not have combinational feedback paths, so it appears very similar to other benign
portions of a user’s logic design. To assess this new power waster, we contrast its
power per basic logic element (BLE) against four competing, previously-described
approaches [40,59,73,94], including three that pass Amazon’s design rule checker [40,
73, 94]. We show that our approach based on low-frequency, single-clock circuitry
can be used to induce timing delay faults in neighboring circuits. The designs are
tailored to Intel Cyclone V, Arria 10 GX, and Stratix 10 FPGAs located on the
Terasic DE1-SoC [77], DE5a-Net [78], and DE10-Pro [79] boards, respectively.
Section 8.1 describes previous power-wasting designs and techniques that bypass
design rule checks, including our new design. The results of the evaluation of the examined designs are presented and discussed in Section 8.2. The chapter is summarized
in Section 8.3.

82

8.1

Power Wasting Logic Circuits

Dynamic power consumption in FPGAs (Eq. (5.1)) is due to the logic signals’
switching capacitance C at frequency fsw between low and high voltage level (VDD ).
Circuits that maximize signal toggling, preferably with low logic resource utilization,
are ideal candidates for wasting power in FPGAs.

8.1.1

RO- and Shift Register-based Power Wasters

Most previous efforts [26,39,48,59], to deliberately waste power in FPGAs as part
of a malicious attack have focused on the use of ring oscillators (ROs), which are easy
to design and build in FPGAs. High-toggle ROs can be efficiently packed into FPGAs
by using individual LUTs as oscillators (Figure 8.1(a)). Up to 20 ROs can be packed
into a single Cyclone V, Arria 10 or Stratix 10 logical array block (LAB). All of these
circuits can be enabled nearly simultaneously through the use of an Enable signal
assigned to a high fanout global network signal. Although ROs are clearly efficient
and have legitimate FPGA uses for voltage [59] and temperature [92] sensing, their
association with malicious attacks makes them a target for cloud FPGA vendors. For
example, the compilation software for Amazon EC2 F1 examines candidate netlists
for ROs and flags them without generating an FPGA bitstream [22, 73]. As a result,
ROs made strictly from LUTs are not a suitable choice for an attacker.
Several researchers have determined that RO-style behavior can be obtained from
FPGA circuits that also contain at least one flip-flop. These types of circuits evade
the combinational loop detector in cloud FPGA compilers (at least for now). Figure 8.1(b) shows an RO alternative based on a high-speed sequential clock generated
from an on-FPGA phase-locked loop (PLL) [40]. This circuit appears more similar
to the standard single-clock sequential circuitry one would typically find in a user
design, although it requires an input clock of hundreds of MHz [40]. To evaluate the
effectiveness of this circuit, the rate of the clock signal triggering the flip-flops should

83

(a) Single-stage RO-based waster.

(c) RO + flop triggered by oscillating signal.
(Cyclone V/Arria 10 implementation.)

(b) RO + flop triggered by a PLL generated clock.

(d) RO + flop triggered by oscillating signal.
(Stratix 10 implementation.)

(e) Multiple instances of n-bit shift registers.

Figure 8.1. Designs in (a), (b), (c), and (d) show the three RO-based wasters used
to dissipate dynamic power in Cyclone V / Arria 10 / Stratix 10 devices. Design (e)
shows the shift register-based waster.

be comparable to the oscillation frequency of a combinational RO. The subfigure
shows an adaptive logic module (ALM) implementing two power wasters of this type
clocked using an on-chip PLL.

84

The need for a high-speed input clock signal generated by a PLL in the power
wasting circuit can be eliminated by rearranging the design input connections (Figure 8.1(b)) to implement a transparent latch or flip-flop triggered by an oscillating
data signal [22,73] (Figure 8.1(c)). Since flip-flops in Cyclone V, Arria 10 and Stratix
10 devices cannot be converted to latches, a flip-flop based design was tested. A
D flip-flop with an active-low asynchronous clear control input (ACLRn) and D input
permanently connected to VCC is used. The Q output of the flip-flop loops back to
itself and drives its inverted clock and ACLRn inputs. Initially, both the clock input
and Q output are low. When the enable signal is asserted, the clock input transitions
from low to high and VCC is clocked to the output Q of the flip-flop. Then, the high
Q output is inverted at the ACLRn input of the flip-flop forcing it to transition to a
logic low, completing one oscillation.
A limitation of this approach is the need to utilize routing resources dedicated to
driving the control signals of the adaptive logic module (ALM). Although Cyclone
V and Arria 10 LABs contain 10 ALMs (20 look-up tables), only two unique clock
signals are supported per LAB [31]. The Stratix 10 LAB also contains 10 ALMs (20
look-up tables), but only a single clock signal is supported [33]. Since each waster of
this type requires a separate clock source, only two wasters can be instantiated in a
Cyclone V/Arria 10 LAB (Figure 8.1(c)) and only one waster in a Stratix 10 LAB
(Figure 8.1(d)). In addition, the wasters illustrated in Figures 8.1(b) and 8.1(c) can
be identified by diagnostic tools searching for short sequential paths [40,51], although
they do currently pass Amazon’s design rule checks (DRCs).
Design scanning for potential malicious circuits can become challenging when
standard circuits are employed for wasting power. Ziener et al. [94] deploy a number
of 16-bit shift registers (Figure 8.1(e)) to shape the power profile of the FPGA and
effectively use them for power watermarking an IP core. Although shift registers are
less effective in wasting power than the RO-based wasters, they are typically coupled

85

with the functional logic of the IP core which makes them practically indistinguishable
from the rest of the design. Therefore, a malicious user can hide a multitude of shift
registers in an IP core to cause voltage instability.

8.1.2

Exploiting Glitch Power

Signal glitching is known to consume significant power in FPGAs [16]. If not
properly managed, differences in signal arrival times at the inputs of logic gates due to
imbalanced path delays can cause unintentional and unnecessary output transitions.
Studies [42, 43] have shown that glitch power can consume up to 19 % of total power
consumption in some designs. Matas et al. [52] exploited glitch power to crash a
Xilinx UltraScale+ development board by instantiating XOR gates and meticulously
creating timing imbalances at their inputs. This approach uses a considerable portion
of the available FPGA routing resources attached to the outputs of the XOR gates
so that each glitching signal switches a large capacitive load.

8.1.3

AES-based Power Waster

Our new power wasting circuit is shown in Figure 8.2(a). This circuit has the basic
structure of a standard 128-bit advanced encryption standard (AES) circuit, although
it does not perform encryption or any other useful function beyond wasting power.
Unlike a standard 128-bit AES circuit that has 10 rounds, in our circuit rounds can
be replicated to form a chain of a user-selected number of rounds. The structure of a
round (Figure 8.2(b)) includes S-boxes (effectively 8-bit to 8-bit lookup tables, shown
as S is the figure), shift rows (wire shuffling with no logic needed), mix columns, and
XOR gates. Between rounds, an additional XOR gate has been added along with
feed-forward paths to enhance glitching through timing imbalance. Before enabling
the waster, an arbitrarily chosen 128-bit value (e.g., 44) is loaded on the input register
that feeds both the key and data inputs of the first round (Figure 8.2(a)). The design
feeds the output of the XOR gate of the first round to the input register to enhance

86

(a) N chained 128-bit AES rounds.

(b) Structure of a single 128-bit AES round.

Figure 8.2. Design in (a) shows our unrolled waster based on glitching that uses
copies of AES encryption rounds. (b) shows the structure of a standard 128-bit AES
round used in our design.

power consumption. Our unrolled version of the circuit causes increased glitching in
the later rounds.
Our new circuit can waste power effectively using a modest clock frequency of ≤
50 MHz and does not require extensive hand tuning of delay paths to operate. From
a structural standpoint, neither high-speed clocks, combinational loops, nor short
sequential feedback paths are needed. To avoid being flagged for timing violations,
the long combinational paths formed in the chained rounds can be marked as false
paths that should be ignored for timing closure. The additional XOR gates inserted
between rounds can be embedded in LUTs and masked with other logic. To locate
this circuit (or one of its many variants) in a user design, a DRC checker must now
consider the logic function of the circuit and not just its topographic structure to
identify malicious intent.

87

8.2
8.2.1

Evaluation of Power Wasting Circuits
Evaluation Methodology

To measure power consumption in the Cyclone V device, a modified DE1-SoC
board powered by a Keysight E36312A benchtop power supply was used. The onboard voltage regulator and inductor were desoldered from the board and the 1.1 V
FPGA core voltage input was connected to the power supply. Power consumption
in the Arria 10 device was measured using an unmodified DE5a-Net board via an
on-board Texas Instruments INA231 power monitor chip on the 12 V supply. The
chip measures the total power consumption of the board. Similarly, power consumption in the Stratix 10 device was measured using an unmodified DE10-Pro board via
an on-board Linear Technology LTC2945 power monitor chip on the 0.9 V FPGA
supply. The LTC2945 chip measures the total power consumption of the FPGA device. Incremental changes in power measure changes attributed to the power wasting
circuitry on the FPGA.

8.2.2

Power Waster Comparison

The FPGA resources used by the AES-based power wasters are shown in Table 8.1.
Clearly, the amount of logic needed for the circuits is more than a single RO (one
LUT). However, previous work [39, 59] has shown that, typically, thousands of ring
oscillators are needed perform a voltage attack.
To assess the effectiveness of AES-based power wasters, we contrast the five
wasters discussed in this chapter in Table 8.2. One AES-based waster and 2,000
(Cyclone V), 6,000 (Arria 10), and 10,000 (Stratix 10) RO- and shift register-based
wasters were used to generate the entries in the table. The AES-based circuit used to
generate the results contained 10, 58, and 95 rounds in the Cyclone V, Arria 10, and
Stratix 10 devices, respectively. In Cyclone V and Arria 10 devices, the unclocked RO
circuits (Figures 8.1(a) and (c)) oscillate at frequencies greater than 700 MHz. The

88

Table 8.1. Resources used in AES-based waster and corresponding reported Fmax
in Cyclone V, Arria 10, and Stratix 10 devices.
Chained
Rounds
LUTs
(Avail.: 64,140)
Fmax
[MHz]

1

10

15

20

1,032
(1.6%)

8,711
(13.6%)

14,926
(23.3%)

19,918
(31.1%)

135

15

8

5

(a) Intel Cyclone V SE (5CSEMA5F31C6) FPGA.

Chained
Rounds
LUTs
(Avail.: 854,400)
Fmax
[MHz]

1

20

40

58

1,015
(0.1%)

19,876
( 2.3%)

44,818
( 5.2%)

67,938
( 8.0%)

226

11

4

2

(b) Intel Arria 10 GX (10AX115N2F45E1SG) FPGA.

Chained
Rounds
LUTs
(Avail.:
1,866k)
Fmax
[MHz]

1

20

40

60

80

100

120

646
(<0.1%)

19,999
(1.1%)

39,847
(2.1%)

59,695
(3.2%)

79,543
(4.3%)

99,391
(5.3%)

119,239
(6.4%)

208

12

6

4

3

2

2

(c) Intel Stratix 10 SX (1SX280HU2F50E1VG) FPGA.

RO+flop (Figure 8.1(b)), shift registers (Figure 8.1(e)), and AES-based circuits (Figure 8.2(a)) are clocked at 50 MHz and 700 MHz to generate comparative results. In
the Stratix 10 device, the unclocked RO circuits oscillate at frequencies greater than
990 MHz. Therefore, the Stratix 10 implementations of the RO+flop, shift registers,
and AES-based circuits are clocked at 50 MHz and 990 MHz to generate comparative
results. Results are represented in dynamic power dissipated per basic logic element,
BLE, that includes a LUT and two flip-flops.

89

Table 8.2. Power increase per BLE for the five power wasting designs shown in
Figures 8.1 and 8.2, in Cyclone V, Arria 10, and Stratix 10 devices.

Device

Cyclone V
Arria 10
Stratix 10

PLL

Shift Reg.

RO+flop

AES-based

[MHz] (Fig. 8.1(e)) (Fig. 8.1(b)) (Fig. 8.2(a))
50

0.02

0.04

0.30

700

0.13

0.41

0.45

50

0.02

0.10

0.91

700

0.25

0.86

0.94

50

0.01

0.06

0.79

990

0.22

1.08

0.83

RO+flop
(Figs 8.1(c)
and (d))

RO
(Fig. 8.1(a))

0.88

0.65

1.92

1.82

2.04

2.58

As one might expect, the unclocked RO-based wasters (Figures 8.1(c), (d), and (a))
are more efficient in consuming power than the other three approaches. Although the
RO+flop design (RO clock, Figure 8.1(c) and (d)) consumes more power than the
AES-based waster, its implementation is restricted to two instances (Cyclone V and
Arria 10) and a single instance (Stratix 10) per LAB, leaving the remainder of the
LAB unusable. The AES-based waster (Figure 8.2(a)) outperforms its shift registerbased counterpart (Figure 8.1(e)) and is competitive with the RO+flop design (PLL
clock, Figure 8.1(b)) at high frequency. At low frequency (e.g., 50 MHz), although the
AES-based waster operates at the same frequency with the two PLL clocked wasters,
it consumes considerably more power and it has a logic structure that is similar to
clocked user logic that requires a modest clock frequency. These characteristics make
it more difficult for a design rule checker looking for malicious circuits to locate it.
The power-wasting effects of glitching can be seen more clearly if the overall dynamic power increase is considered across a range of AES-based circuits with increasing round counts. Figures 8.3(a), (b), and (c) show the dynamic power consumed by
the circuits for the Cyclone V, Arria 10, and Stratix 10 devices. Measurements in
the Cyclone V device end at 14 rounds because the 5 A current limit is reached on
the benchtop power supply used to power the modified board. Power consumption
90

Power Increase [W]

20

fpw = 50MHz
fpw = 4MHz

10
0

0

10
Number of AES Rounds

20

Power Increase [W]

(a) Cyclone V

100
80
60
40
20
0

fpw = 50MHz
fpw = 4MHz

0

20

40
60
80
Number of AES Rounds

100

120

Power Increase [W]

(b) Arria 10

100
80
60
40
20
0

Board
Crashes

fpw = 50MHz
fpw = 4MHz

0

20

40
60
80
Number of AES Rounds

100

120

(c) Stratix 10

Figure 8.3. Power consumption while increasing the number of chained 128-bit AES
rounds in Cyclone V, Arria 10, and Stratix 10 devices.

in Arria 10 and Stratix 10 devices was measured using the on-board power monitor
chips. The graphs show that the 50 MHz and 4 MHz curves converge as the number

91

of rounds increases. This suggests that the AES-based waster can reach a specific
power consumption independently of the frequency used to clock its input register.
Similar to the activation of 30,000 RO-based wasters discussed in Section 5.2.1,
in the Stratix 10 device, activating more than 95 rounds results in total power consumption that exceeds 85 W and makes the device inaccessible after the completion
of the experiment. With 120 rounds, the DE10-Pro board crashes right after turning
on the waster and no power measurement data can be extracted. All the experiments
with configurations containing more than 100 rounds required a manual power cycle
of the board in order to restore operation.

8.2.3

Fault Generation with AES-based Power Wasters

In this section, we describe an attempt to induce delay faults in a ripple carry
adder located adjacent to the power-wasting circuit in Cyclone V and Arria 10 devices.
The experimental setup that realizes this particular attack scenario when deployed
and tested on the Stratix 10 device resulted in no delay faults. A script generates
vectors that sensitize paths with slack ranging from +3 ns to −3 ns in the Cyclone V
device and from +0.3 ns to −0.3 ns in the Arria 10 device. The timing slack of each
path in an adder instance is reported using the TimeQuest Timing Analyzer [29].
The slow 1100 mV 85 ◦C model is used for the Cyclone V implementation of the
adder and the slow 900 mV 100 ◦C model for the Arria 10. The vectors are repeatedly
applied for 200 µs during the power attacks and a log is kept with the faults and
their timestamps. For these experiments, both the Cyclone V and Arria 10 boards
were unmodified (e.g., the Cyclone V based DE1-SOC board was not powered by an
external power supply, but instead used the on-board regulator and inductor.)
Figure 8.4 shows the faults that occur from the attack. The X and Y coordinates
of each point denote the time and reported slack of the path on which the fault
occurred. Every point on the plot depicts the capture of an incorrect result. Paths

92

(a) Cyclone V: 20 AES rounds

(b) Arria 10: 58 AES rounds

Figure 8.4. Causing delay faults on adder circuits placed outside the wasting area
when the adversary at time 0 turns on 20 and 58 128-bit unrolled, chained AES rounds
clocked at 50 MHz in Cyclone V and Arria 10 devices, respectively. X-coordinate
denotes the time the fault occurred during the attack. Y-coordinate is the reported
timing slack of the exercised path.

with more slack are less susceptible to delay faults. Red points denote faults on paths
with positive slack, which are paths that meet timing constraints according to the
conservative timing model. Blue points originate from paths that have negative slack
according to the conservative timing model, but are error free in the absence of an
attack when repeatedly sensitized for 200 µs prior to the activation of the wasters.
The results indicate that in both devices there is a significant time period in
which faults occur (e.g., 0 µs to 30 µs for the Cyclone V and 0 µs to 15 µs for the
Arria 10 devices). The activation of the wasters creates a combined L(di/dt) and
iR voltage drop that causes the core voltage to fluctuate [59]. Then, the inductive
effect gradually diminishes allowing the core voltage to settle to a stable value. In
the Arria 10 experiment (Figure 8.4(b)), steady-state is reached approximately 20 µs
after activating the wasters beyond which point no faults are observed. The results
also show a peak of faults immediately following the enabling of the wasters. These

93

faults are attributed to the initial response of the PDN to the sudden activation of the
wasters. This activation led to a large but brief voltage drop, an effect that was also
observed by Zick et al. [93] during experimentation with a Xilinx Kintex-7 device.

8.3

Summary

In this chapter, we introduced a new power wasting circuit built from standard
AES encryption rounds. The circuit passes Amazon EC2 design rule checks and
operates effectively as a power waster, even at low clock speeds. We demonstrate
that the circuit can induce faults in ripple carry adders for FPGAs from two Intel
device families.

94

CHAPTER 9
MONITORING SYSTEM FOR PDN ATTACKS

This dissertation work and recent research [25, 91, 93] have highlighted the importance of developing active defense strategies to mitigate voltage attacks in FPGA
devices. Defense approaches include the off-chip tracking of FPGA power consumption and on-chip voltage monitoring [59, 93]. In this chapter, we examine the use of
a network of small on-chip voltage sensors to identify the location of an attack on
the FPGA die. Then, we advance the application of the sensor-based solution by
integrating a monitoring system onto the Stratix 10 device that utilizes sensor data
to enable the real-time detection and mitigation of voltage attacks.
Section 9.1 describes our methodology in localizing PDN disturbances using an
on-chip voltage sensor network. In Section 9.2, we describe the principles of our onchip monitoring system, detail our Stratix 10 prototype, and evaluate the approach.
The chapter is summarized in Section 9.3.

9.1

Localizing Voltage Droops

PDN attacks require power consumption, transiently or in steady-state, beyond
what the power distribution network can handle. Our results have shown that the
power consumption of one adversarial block can cause a measurable and significant
difference in the voltage of other blocks. Circuits closest to the power consumption
experience the largest voltage drop, and the voltage drop becomes smaller moving
farther away (Figure 5.6). The voltage gradients effectively provide a map pointing
toward the center of the attack, which will have the lowest voltage. A spatially
95

distributed network of voltage sensors can enable a resource manager to monitor
voltage gradients and identify the source of any attacks that occur. The resource
manager can then prevent further instances of the same attack by taking the offending
application offline, or banning it from co-tenant settings.

9.1.1

Monitor Network

Networks of 46, 132, and 218 sensors in Cyclone V, Arria 10, and Stratix 10
devices, respectively, monitor voltage fluctuations and log the resulting data. The
area costs of the monitor networks are given in Table 9.1. In all three devices, each
sensor uses 20 ALMs and 20 flip-flops. In the Cyclone V device, the 46 sensors
collectively consume 2.9% of the ALMs. In Arria 10 and Stratix 10 devices, the
132 and 218 sensors, respectively, consume less than 1% of their ALMs. All three
implementations use less than 1% of the flip-flops on the chip. Table 9.1 shows the
resources required for the controller logic that logs the sensor data to memory for the
sensor networks of 46, 132, and 218 sensors in Cyclone V, Arria 10, and Stratix 10
devices, respectively.
Figure 9.1 shows the voltage contours of the three devices based on sensor data
collected during two different power attacks. The data values used to generate each
plot include the minimum value observed by each sensor during the 500 µs period
of the attack. A cubic interpolation algorithm reconstructed the smoothed voltage
contours from the samples collected at the discrete sensor locations.
The two power attacks on each chip vary in the magnitude of the PDN disturbance
and the location of the attacker on the chip. The details of each attack are shown
in Table 9.2. For the Cyclone V device, as denoted on the voltage contour lines in
Figure 9.1(a), when the attacker turns on 12,000 RO-based power wasters, the voltage
at the center of the attack drops below 825 mV, and the voltage at the farthest corner
of the FPGA drops to 975 mV. In Figure 9.1(b), the use of 3,200 wasters drops the

96

Table 9.1. Resources used in voltage monitoring network for various numbers of
sensors for the three selected devices (Cyclone V SE (5CSEMA5F31C6), Arria 10
GX (10AX115N2F45E1SG), and Stratix 10 SX (1SX280HU2F50E1VG)).
Device

Number of

ALMs

Flip-flops

20

400 (1.2%)

400 (<1%)

46

920 (2.9%)

920 (<1%)

Controller

430 (1.3%)

111 (<1%)

64

1,280 (<1%)

1,280 (<1%)

132

2,640 (<1%)

2,640 (<1%)

Controller

1,008 (<1%)

134 (<1%)

128

2,560 (<1%)

2,560 (<1%)

218

4,360 (<1%)

4,360 (<1%)

Controller

1,548 (<1%)

154 (<1%)

RO sensrors
Cyclone V

Arria 10

Stratix 10

Table 9.2. Attack scenarios used for evaluating the monitor network with the RObased power waster.
Device

Cyclone V

Arria 10

Stratix 10

Allocated Area

Number of

Type of

[rows by cols LABs]

RO instances

Attack

32 x 44 (1,408)

12,000

Strong

20 x 20 (400)

3,200

Weak

168 x 68 (11,424)

28,000

Strong

56 x 64 (3,584)

8,000

Weak

104 x 64 (6,656)

30,000

Strong

56 x 64 (3,584)

8,000

Weak

voltage below 975 mV near the attack and the voltage at the farthest corner of the
FPGA remains above 1.050 V.
In the Arria 10 device, the 28,000 RO-based power wasters drop the voltage to
767 mV at the center of the strong attack (Figure 9.1(c)), while the voltage in more
than half of the FPGA fabric is 100 mV below the nominal 0.9 V. In the weak attack
(Figure 9.1(d)), the 8,000 wasters have a lower impact, dropping the voltage below

97

0.925

0.975

0.900

0.892

00

2
0.85

97
0.7

0.797

0.975

(b) Cycl.V: 3.2k PWs

0.8

42

0.862

(a) Cycl.V: 12k PWs

1.025

0.925

0.825

0.950

1.050

0.875

0.8
5

0

0.900

1.0

0.810

(c) Arria10: 28k PWs (d) Arria10: 8k PWs

0
0.83

0.787

0.850

0.7

0.797

67

0.777

0.840

0.820

(e) Stratix10: 30k PWs

(f) Stratix10: 8k PWs

Figure 9.1. Map of voltage contours on chip during power attacks, reconstructed
from sensor data. Purple rectangle denotes location of the attacker’s power waster
circuits. Orange/red rectangles are the sensors.

862 mV only in the vicinity of the attacker. In the Stratix 10 device (Figure 9.1e),
the attacker turns on 30,000 power wasters within an area spanning 6,656 LABs (104
rows by 64 columns), shown in dark blue in the figure. The voltage at the center
of the attack drops below 800 mV while the voltage in almost half of the FPGA
fabric is 48 mV below the nominal 0.9 V. In the weak attack (Figure 9.1(f)), enabling

98

8,000 wasters drops the voltage below 881 mV near the attack and the voltage at the
farthest corner of the FPGA remains above 904 mV.

9.1.2

Attack Attribution

A goal for the monitoring network is to determine the source of any attacks that
occur. In the case of PDN attacks, the attacker cannot easily mask their identity
because of the spatial extent of the voltage drops that they cause. Here, we evaluate
the number of sensors required to find an attacker based on voltage contours. For
selected Cyclone V and Arria 10 attack scenarios (Table 9.2) examined in the previous
Section 9.1.1, we consider how precisely the attacker can be located using 10, 20, 30,
or 40 of the 46 sensors in the Cyclone V device and 32, 64, 96, or 128 of the 132
sensors in the Arria 10 device. Reducing the number of used sensors reduces the cost
of the monitoring network. In Section 9.1.3, we repeat the analysis for the Stratix
10 device using the AES-based waster. For each sensor count, we randomly choose
100 different subsets containing that number of sensors, and from each subset, try to
predict the location of the attacker.
Figure 9.2 shows the results of this analysis. The dots on each plot are 100
predictions of the attacker location. As one might expect, the chance of successfully
locating the attack increases with the number of sensors. In the Cyclone V device
(Figure 9.2(a) and (b)), predictions based on 20 or more sensors converge to a location
within the attacking circuit. Similarly, in the Arria 10 device (Figure 9.2(c) and (d)),
using 64 or more sensors causes predictions to converge to a location within the attack
area. These results show that a network of monitors can locate the attacker with less
than 46 sensors in the Cyclone V device and with less than 132 sensors in the Arria
10 device. The overall low hardware overhead of the monitoring system should not
interfere with the design of other circuits.

99

10 Sensors

20 Sensors

30 Sensors

40 Sensors

10 Sensors

(a) Cyclone V: locating attack that uses 12,000 power wasters.
32 Sensors

64 Sensors

96 Sensors

20 Sensors

30 Sensors

40 Sensors

(b) Cyclone V: locating attack that uses 3,200 power wasters.

128 Sensors

(c) Arria 10: locating attack that uses 28,000 power wasters.

10 Sensors

20 Sensors

30 Sensors

40 Sensors

32 Sensors

64 Sensors

96 Sensors

128 Sensors

(d) Arria 10: locating attack that uses 8,000 power wasters.

Figure 9.2. Marks represent predicted center of attack based on a randomly selected subset of sensors. Each subplot contains
100 points.

100

9.1.3

Stratix 10 Evaluation using an AES Power Waster

In this section, we evaluate the sensor network using a 95-round AES-based waster
(Section 8.1.3). For experimentation, we allocate an area of 16,384 LABs (128 rows
by 128 columns) to the waster which corresponds to approximately 15% of the total
Stratix 10 FPGA area. The area difference for the two waster types (RO- and AESbased) is related to the relative difference in power-wasting efficiency (Table 8.2). The
AES-based waster requires more area to consume a similar amount of power.
Similar to Figure 9.1, Figure 9.3(a) shows the voltage contours of the chip based
on sensor data collected during the activation of the 95-round AES-based waster. The
voltage at the center of the attack drops below 827 mV while the voltage in almost
half of the FPGA fabric is 42 mV below the nominal 0.9 V. Figure 9.3(b) plots voltage
over time captured by the sensors located at LAB column 50 where the disturbance
is a maximum.
Figure 9.4 shows voltage plotted against distance from the center of the attack on
the Stratix 10 device. The red line in the figure was generated by plotting the minimum voltages captured by the sensors situated at the LAB row which experiences the
maximum drop in Figure 9.3(a). Similarly, the black line was constructed with sensor
data from the LAB row located nearest to the peak of the disturbance in Figure 9.1(e).
Although the enabling of the two wasters results in similar power consumption, the
RO-based waster leads to a steeper gradient because power consumption is denser in
the attack area.

9.1.3.1

Minimizing the Number of Sensors

For resource efficiency, it is desirable to determine the minimum number of sensors
required to localize an attacker occupying a certain LAB area. Assuming an AESbased attack scenario of 16,384 LABs (128 rows by 128 columns) used by waster
logic, we examine how precisely the attacker can be located using varying numbers of

101

0.858

100

150

LAB rows

200

250

300

350

400

100
150
200

0.838

250

.0 84
8

LAB columns

50

50

(a) Activating the AES-based waster containing 95 rounds in Stratix 10.

1.0
0.96
0.92
0.88

0.90
Voltage [V]

Normalized RO Counts

0.95

0.85
0.80
0.75 200

193 LAB cols away
170 LAB cols away
147 LAB cols away
122 LAB cols away
99 LAB cols away
77 LAB cols away
70 LAB cols away
51 LAB cols away
42 LAB cols away
27 LAB cols away
19 LAB cols away
4 LAB cols away

100

0
100
Time [ s]

200

300

(b) Voltage reported by the on-chip sensors situated at LAB row 50.

Figure 9.3. Activation of the 95-round AES-based waster in Stratix 10.

sensors. As in Section 9.1.2, for each sensor count, we randomly select 100 different
subsets containing the selected number of sensors, and from each subset we attempt
to predict the location of the attack circuitry.
The results of this analysis are shown in Figure 9.5. The black dots on each
plot are 100 different predictions of the attacker location. We found that when we
utilize data from all 218 sensors of the network, all predictions converge to a specific
location inside the attack area (the bottom, rightmost subplot of Figure 9.5). The
epicenter of the voltage disruption and the topological center of the attacker area

102

0.92

30k ROs (30k LUTs)
95 AES rounds (94k LUTs)

Voltage [V]

0.90
0.88
0.86

Edge
of PW

0.84

Edge
of PW

0.82
0.80
0

25

50

75

100

125

150

175

200

Distance to center of PW [LAB columns]
Figure 9.4. Voltage change across distance for the RO- and AES-based wasters.

might not precisely coincide. Using this result as a best case, we evaluated results
for a sensor count of ten and then increased sensor counts in increments of ten. As
one might expect, utilizing data collected from more sensors increases the precision
of the predictions. Using only 10 sensors leads to predictions that point outside of
the attacking circuit. Predictions start to point to locations inside the attack area
when 50 or more sensors are used.
To quantify the error of each number of sensors, we use the Euclidean distance
between the mean of the 100 location predictions of a given subset of sensors and the
mean of the 100 location predictions that utilize all 218 sensors. The distance errors
for each configuration are expressed in LABs and shown in Table 9.3. The 70-sensor
predictions, for example, on average predict the center of the attack to be 14 LABs
away from the predicted center when all 218 sensors are used.

103

10 sensors

20 sensors

30 sensors

40 sensors

50 sensors

60 sensors

70 sensors

80 sensors

100 200
90 sensors

100 sensors

110 sensors

120 sensors

130 sensors

140 sensors

150 sensors

218 sensors

400
350
300
250
200
150
100
50

400
350
300
250
200
150
100
50
100 200

Figure 9.5. Locating the attacker with the minimum number of sensors required.
Black marks represent the predicted center of attack area based on a randomly selected subset out of the total 218 sensors. Each subplot contains 100 points. Note
that although the predictions converge to a specific location when all the 218 sensors
are used (lower-right corner of figure) the center of the disruption and the center of
the attacker area may not coincide.

9.2

On-chip Monitoring and Attack Throttling

In this section, we enhance the voltage sensing network described in Section 9.1
to include direct remediation to prevent a voltage attack. The network is augmented
with processing capabilities and clock throttling circuitry for suspected power wasting
regions. Our solution is evaluated using an AES-based waster that attempts to crash
the board with the FPGA device. We show that our system is able to respond quickly

104

Table 9.3. Absolute error expressed in LABs of the Euclidean distance between
the mean of the 100 location predictions of a given sensor subset and the mean of
the 100 location predictions obtained when all 218 sensors are used (see the bottom,
rightmost subplot of Figure 9.5).
Num. of

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

sensors
Absolute

32 26 24 23 21 20 14 14 11

11

10

10

7

5

5

error
enough to prevent the FPGA-based board from going into reset by throttling the clock
attached to the power waster without affecting surrounding user logic.
Our monitoring and attack throttling approach consists of both hardware and
software components. The steps involved in this system include:
1. Our system periodically collects on-FPGA voltage values from the voltage sensor
network. Multiple sensors are assigned to each region in the FPGA. These values
are passed to a microprocessor.
2. The processor compares the values to a pre-determined threshold to determine
if an attack is potentially in progress.
3. If voltage drop in a region is more than an acceptable threshold, the clock buffer
to the associated region is deactivated, throttling the attack.
The details and effectiveness of our system are examined in the remainder of this
section.

9.2.1

System Infrastructure

Figure 9.6 shows an overview of the monitoring system implemented on a Stratix
10 SX (1SX280) FPGA device. The sensor network, a NIOS II-based processor system, and interfacing logic implement the monitoring system. The system isolates
each tenant’s circuity in a clock region. Each region’s clock buffers are controlled by
105

Figure 9.6. On-chip monitoring system overview.

the NIOS II processor. This component periodically collects voltage information in
the form of RO counts from the voltage sensors. These values are analyzed to detect
incidents of aggressive power consumption. If the ratio of the average reported RO
counts to expected (nominal) RO counts in a clock region is below a predetermined
threshold, the NIOS II processor disables the clock buffers for that region, forcing all
sequential circuits to freeze.
The RO counts from all sensors of each region are summed by a 13-bit accumulator
unit (i.e., an adder feeding a register whose output then drives one of the two adder
inputs) assigned to each region. The 13-bit sums of each of two regions (e.g., Tenants
A and B and Tenants C and D in Figure 9.6) are grouped together to form 32-bit
words. The grouped sums are then passed to the NIOS II processor via an Avalon
Memory-Mapped (AVMM) bridge. The processor reads and ungroups the 32-bit
words to calculate the averaged RO counts of each region using a hardware-based
floating-point division unit (FPDU). In the Stratix 10 device, both NIOS II processor
and AVMM bridge are clocked at 200 MHz.

106

Figure 9.7. The figure illustrates a partitioning of an area spanning 65,536 LABs
(256 rows by 256 columns) into four clock regions that can be used by multi-tenant
applications. Each region contains nine RO-based sensors. Their relative locations
on the device are indicated by colored rhombuses. The four regions span roughly 60%
of the FPGA logic area.

9.2.1.1

Clock Regions and Sensors

The 1SX280 device contains 117,072 LABs (432 rows by 271 columns) of which
111,300 LABs can be used by the designer. The remaining LABs are reserved for
use by support circuitry for the on-chip ARM hard processor system, FPGA I/O
logic, and system peripherals. In our prototype, 65,536 LABs (256 rows by 256
columns), approximately 60% of the available logic, are allocated for use by multitenant applications. The 65,536 LABs span four clock regions that contain 16,384
LABs (128 rows by 128 columns) each. These regions can be allocated to four distinct
users. Figure 9.7 depicts the clock regions and their relative locations on the Stratix
10 1SX280 fabric with four rectangles of different colors (i.e., cyan, purple, green, and
red). A total of 36 uniformly-placed sensors were used to localize an attacker within
one of the four 16,384 LAB regions. In Figure 9.7, the 36 sensors are shown as colored
rhombuses inside their assigned regions they are assigned to. Each sensor consists of
a 19-stage RO triggering a 13-bit binary counter. The sensors are placed so that the
boundary between two regions is equidistant from sensors in both regions.
107

Table 9.4. FPGA resources used for the NIOS II-based on-chip monitoring system.
Type
ALMs
(Avail.: 933 k)
Flip-flops
(Avail.: >3.732 M)
Memory
(Avail.: 229 Mibit)

Sensor network
(36 sensors)

Interfacing logic

NIOS II
system

576 (<0.1%)

440 (<0.1%)

2,842 (0.3%)

576 (<0.1%)

910 (<0.1%)

3,103 (0.1%)

–

–

2.08 Mibit
(<1%)

Table 9.4 shows the FPGA resources required by the monitoring system. The 36
sensors in the four clock regions collectively consume less than 0.1% of the available
adaptive logic modules (ALMs) and flip-flops in the 1SX280 device. The nine sensors
in each clock region occupy 18 out of the 16,384 LABs in each region. The monitoring system interface to the NIOS II processor (440 ALMs) has minimal hardware
requirements. The interface is located outside of the four clock regions. The NIOS
II system itself has negligible hardware requirements consuming less than 1% of the
available resources.

9.2.1.2

Threshold Calibration

A decision to throttle the clock of a specific clock region requires two pieces of
information, voltage sensor information indicating that an attack is in progress and
the likely location of the attack. As described in Section 5.1.2, the voltage at an
on-FPGA location is approximated by a sensor that counts RO oscillations over a
fixed period of time. A significant drop in RO count below a certain threshold value
indicates voltage instability that can potentially reset the FPGA board. In this work,
we identify a threshold value via calibration that helps identify a potential attack. RO
counts below this threshold are considered attacks requiring immediate remediation.
The calibration method used to determine threshold for our experiments included
the following steps:

108

• Using the AES-based waster introduced in Section 8.1.3, we determined the
minimum number of AES rounds required to crash the board (see Figure 8.3(c)).
A waster that contains 91 chained rounds was found to crash the Stratix 10
FPGA-based board in <1% of trials. A waster with 95 or more rounds resulted
in a board crash in all trials.
• In the 1SX280 device under typical operating conditions, the sensors of the
attack region report an average RO count decrease to 93%-92% of their nominal
counts when the waster is activated. This RO count decrease to 93% (0.93) was
set as the threshold ratio for further experimentation.
Our experimentation showed that the extracted threshold consistently identified
an attack and no recalibration was necessary during experimentation. However, the
threshold selection process should be adapted when operating conditions change (e.g.,
temperature).

9.2.1.3

Attack Detection and Remediation

Following the collection of voltage sensor RO counts every 5 µs, an algorithm is
executed in software to evaluate possible attacks. A sample period of 5 µs provides
sufficient time to obtain reliable RO counts from the voltage sensors. This sample
period is also brief enough to allow the NIOS II processor to respond to the attack
and suppress it. Pseudocode for the algorithm executed on the processor is shown
in Algorithm 1. The algorithm proceeds as follows. At the beginning (line 1), a
function determines the nominal RO counts of each clock region without activated
wasters by collecting and averaging 10,000 sensor samples over a time period of 50 ms.
The nominal RO counts are then (lines 2 to 4) multiplied by the number of sensors
the regions contain (e.g., 9). The resulting values are used in a subsequent step of the
algorithm to calculate the average reported-to-nominal RO count ratio of a region.
These two steps are executed only once, before the monitoring of the clock regions
109

Algorithm 1 Monitoring sensors and switching off clock regions upon detecting an
attack.
1: nom counts[N] ← determine nominals();
. N is the number of clock regions
2: for i ← 0 to N − 1 do
3:
nom counts[i] ← M · nom counts[i]; . M is the number of sensors per region
4: end for
5: start sensors();
6: while True do
. it returns the sums of the N regions
7:
x ← read sample when is ready();
8:
for i ← 0 to N − 1 do
9:
. calculate the reported-to-nominal RO count ratio of region i
10:
y ← x[i] / nom counts[i];
11:
if y ≤ threshold ratio then
12:
disable clock(i);
13:
end if
14:
end for
15: end while
begins. Next, monitoring begins (line 5) and the execution waits at line 7 for the
sensor data of the first sampling period to become available (e.g., after 5 µs). The RO
counts from each region i are summed to form a combined count x[i]. The algorithm
then calculates the ratio of the reported RO count x[i] to nominal RO count of
each region i (line 10). The resulting values are compared against the threshold ratio
(line 11) determined during calibration. If the reported-to-nominal RO count ratio
of a region is less or equal to the threshold, the algorithm disables its clock buffer
(line 12). If it is greater than the threshold, no action is taken. When all regions have
been checked, the execution loops back to line 7 where it waits for the next sample.
Floating-point division operations (line 10) and AVMM read transactions (line 7)
are the most time-consuming operations that take place during the execution of the
while-loop in Algorithm 1. Executing a single division requires 191.94 ns, on average,
while a single AVMM transaction takes 160.56 ns, on average. Without considering
line 12, which is executed only when an attack is detected, the loop within a single
iteration performs two AVMM reads at line 7 and N divisions because of line 10.

110

(a) Start of the attack

(b) End of the attack

Figure 9.8. Intel SignalTap waveforms capturing the start and end of a voltage
attack attempting the crash the FPGA board using the 95-round AES-based waster
and prevented by the on-chip monitoring system.

Neglecting the single-cycle operations of the loop (e.g., index increment, etc.), a
single iteration of the loop requires approximately 1.09 µs, on average, when N=4.
9.2.2

Preventing Board Crashes

To evaluate the effectiveness of our remediation system, we performed a series of
experiments using the Stratix 10 device. The experimental setup included the system
shown in Figures 9.6 and 9.7. A 95-stage AES-based waster was placed in the clock
region labeled Tenant B in Figure 9.7. After loading the monitoring application on
the NIOS II processor, Algorithm 1 is executed. Waveforms obtained in real-time
from the Intel SignalTap logic analyzer are shown in Figure 9.8. The clock enable
and clock signals of the four clock regions were controlled by the NIOS II processor
during the test, as shown in the waveforms in Figure 9.8.
Figure 9.8(a) captures the beginning of the attack. At time 0, the monitoring
application enables the four clock regions by asserting the enable signals of the clock
buffers (TenantA: clk enable - TenantD: clk enable). At time 10 ns, the clock
buffers are activated by the clk enable signals and feed the clock signals (TenantA:
clock - TenantD: clock) to the four regions. To initiate the attack, the enable signal

111

of the waster (AES waster enable) is asserted at time 15 ns by user logic which then
activates the AES-based waster one 50 MHz clock cycle later at time 50 ns. A reference
signal AES waster active is generated by the waster for reference only and is shown
on the waveforms.
Figure 9.8(b) captures the end of the attack and the successful prevention of
a board crash. At time 11.1 µs, the monitoring system responds to the aggressive
power consumption by Tenant B by switching off its clock buffer (TenantB: clk
enable). The sensors of Tenant B reported an average RO count that is below the
predetermined 93% threshold (Section 9.2.1.2) of their nominal value. The clock
signal of Tenant B is then disabled and the waster is forced to a halt. The response
is fast enough to prevent a device crash in the 21 µs range. The clock signals of the
three remaining tenants continue oscillating. The attack shown in the waveforms
was performed 1,000 times and the system was able to stop the attack for all 1,000
attempts and prevent a board crash. The response time of the system to the attack
is 11.08 µs, on average.
9.2.3

Using the ARM-HPS for Processing Sensor Data

In this subsection, we examine the use of a hardened processor system (HPS) for
processing the sensor data instead of using a NIOS II system which is implemented
with FPGA logic. The Stratix 10 SX devices contain an ARM-based HPS that
consists of four A53 64-bit CPU cores with floating-point units, two levels of cache
memory, and a 256 kB on-chip memory. The monitoring system uses the same interfacing circuitry discussed in Section 9.2.1 and shown in Figure 9.6 except that data
are passed to the ARM-HPS. Specifically, the ARM-HPS accesses the grouped 13-bit
sums via the Avalon Memory-Mapped (AVMM) bridge and the AMBA Lightweight
Advanced eXtensible Interface (LWAXI4). The CPU cores are clocked at 1.2 GHz,
the LWAXI4 bridge is clocked at 400 MHz, and the FPGA logic of the sensor net-

112

Table 9.5. FPGA resources used for the ARM-HPS-based on-chip monitoring system.
Type

Sensor network
(36 sensors)

Interfacing logic

ARM-HPS
(FPGA portion)

576 (<0.1%)

440 (<0.1%)

18,363 (2%)

576 (<0.1%)

910 (<0.1%)

26,046 (0.7%)

–

–

2.25 Mibit
(1%)

ALMs
(Avail.: 933 k)
Flip-flops
(Avail.: >3.732 M)
Memory
(Avail.: 229 Mibit)

work is clocked at 200 MHz. The ARM-HPS-based system adopts all the other design
specifications detailed in the prior subsections for the NIOS II version. These specifications include the partitioning scheme of the FPGA logic to tenant/clock regions
(Section 9.2.1.1), threshold calibration procedure (Section 9.2.1.2), and remediation
algorithm (Section 9.2.1.3).
Table 9.5 shows the FPGA resources required by the ARM-HPS-based monitoring
system. The sensor network and interfacing logic have the same hardware requirements as the NIOS II version (see Table 9.4). The FPGA portion of the ARM-HPS
includes logic required for implementing interfaces to memory and bridge interconnects. Comparing to the total hardware resources the NIOS II system requires (2,842
ALMs and 3,103 flip-flops), the FPGA portion of the ARM-HPS consumes considerably more hardware (18,363 ALMs and 26,046 flip-flops). Both ARM-HPS and NIOS
II implementations have similar memory requirements, 2.25 Mibit and 2.08 Mibit, respectively.
As one might expect, the ARM-HPS outperforms the NIOS II in floating-point
operations by completing a single division within 17.8 ns, on average, compared to
the 191.94 ns required by the NIOS II. However, read/write transactions in the ARMHPS (480 ns) are slower than in the NIOS II (160.56 ns) since they have to undergo
several data bus interface conversions (e.g., AVMM-LWAXI-cache interface). The
ARM-HPS executes a single iteration of the while-loop of Algorithm 1 in 1.03 µs, on

113

average, when N=4, compared to the 1.09 µs the NIOS II implementation requires.
Both implementations and their corresponding execution times are well below the
target sampling period of 5 µs, which leads to a CPU time usage of less than 22%.
Repeating the attack described in Section 9.2.2 showed that the ARM-HPS-based
monitoring system successfully prevents board reset in all 1,000 trials. The response
time of the system to the attack is 9.95 µs, on average.
9.2.4

Limitations

The following points summarize the limitations of our monitoring approach:
1. Clock gating-based mitigation. In the current prototype, clock gating is
used to suppress a potential attacker. This approach is only effective if the
attack circuitry depends on a clock signal for operation (e.g., the AES-based
waster used for this work.) This approach would not be effective for wasters
without globally distributed clocks (e.g., Figure 8.1(c)), although our sensors
would detect the attack and flag the user, possibly preventing future attacks.
2. Local user clocks. Often, a designer uses clock resources (e.g., clock managers,
phase lock loop modules, etc.) to generate custom clock signals for internal
use in an allocated region. For a secure cloud-based system, the reference clock
driving these clock resources could be provided by global clock-buffers controlled
by the monitoring system so it can be disabled when needed. For example, in
AWS F1 instances, user clock signals are generated by a PLL in the shell [7]
and could be suppressed in case of attack.

9.3

Summary

In this chapter, we propose the use of a network of small voltage sensors that
collect voltage information and pass it to a central controller. We demonstrate that
the source of a voltage-altering attack can be easily identified by a small number
114

of sensors consuming less than 3% of FPGA logic. We have also described a new
on-FPGA remediation approach that collects voltage values from multiple tenants in
real time and throttles the clock to any region suspected of malicious behavior. Our
approach can respond to an attack within 18 µs and successfully prevented a series of
voltage attacks from causing board reset.

115

CHAPTER 10
CONCLUSION AND FUTURE WORK

This dissertation examined a new monitoring system for embedded microprocessors and multi-tenant FPGA attacks and remediations. The contributions of the work
are summarized below.

10.1

Summary of Contributions

Chapter 3 presented a defense mechanism that uses hardware to monitor the
operation of an embedded Linux operating system. It was shown that by selectively
monitoring only the vulnerable portions of the code (e.g., the system call interface)
the monitor can verify individual processor instructions and ensure correct program
execution. The approach was evaluated in hardware and demonstrated to successfully
prevent a control-flow hijack attack that exploits a vulnerability in Linux system call
code. Low hardware overhead and limited impact on system performance make the
proposed solution suitable for embedded systems.
As FPGAs grow in size and logic capacity, scenarios have emerged in which circuits from multiple designers are deployed on an FPGA at the same time (e.g., multitenancy). Recent research [21,67] has identified a crosstalk coupling effect that exists
between the long wires of the FPGA routing channel which can be exploited by a
malicious user to attack a co-located tenant. This type of attack can be performed
remotely as physical access to the FPGA is not necessary. Although the phenomenon
can be easily observed with the use of a simple RO-based sensor, the metric previously
used for quantifying the amount of coupling is sensitive to the design properties of
116

the sensor itself. Chapter 4 describes a new metric that eliminates sensor-design variabilities from RO-count measurements and accurately quantifies the delay effect. The
new metric was used to characterize coupling on three Intel FPGAs implemented at
different technology nodes. It was shown that the effect is present in all examined devices, although the susceptibility of different FPGA long wire types to leakage varies.
A methodology that constructively uses the crosstalk effect to enable the recovery
of the adjacency information from routing channels was also described. Knowledge
of the relative position of channel wires can help FPGA application designers isolate
and protect signals carrying sensitive data against information leakage attacks.
Unlike multi-core microprocessors, all core logic on contemporary commercial FPGAs shares the same PDN. As shown in Chapter 5, the shared use of the PDN in
a multi-tenant FPGA setting allows voltage attacks that require no physical device
access. Our work characterizes the FPGA PDN response during a voltage attack as
the disturbance magnitude as a function of time, the power consumed by the attacker,
and the position of the victim relative to the attacker. Our analysis is performed for
three Intel FPGA families. In Chapter 6 it was shown that power wasting circuits
in one portion of an FPGA can lead to faults in distant FPGA locations, even for
circuit paths with significant slack. Chapter 7 demonstrates a power-based fault injection attack on an RSA cryptosystem using power wasters. The attack was shown
to successfully expose the private key of the circuit without requiring synchronization
with specific encryption rounds nor physical access to the device.
One approach to avoiding multi-tenant voltage attacks is to scan a tenant’s netlist
and flag suspicious circuit structures (e.g., ROs) [40]. Chapter 8 introduces a new
power wasting circuit that is built from standard AES encryption rounds in an effort
to avoid scan detection. The AES-based waster is shown to consume significant power
and is capable of inducing timing faults in victim circuits.

117

To mitigate voltage attacks, the use of an on-chip voltage sensor network that
monitors the FPGA PDN was proposed in Chapter 9. In the first part of the chapter,
it was shown that the source of a voltage-altering attack can be easily identified
by a small number of sensors consuming less than 3% of the FPGA on-chip logic.
The ability to localize voltage disruptions allows for the selective clock disabling of a
suspicious FPGA logic region. The second part of the chapter describes an on-chip
monitoring system that collects and analyzes voltage information from the sensor
network and disables clocks in regions exhibiting malicious behavior in real time.
The approach was found to successfully prevent a voltage attack that attempts to
reset the FPGA board.

10.2

Future Directions

10.2.1

Microprocessor Monitoring

The selective monitoring technique introduced in Chapter 3 dramatically reduces
the memory overhead of per-instruction hardware monitoring. In the presented prototype, monitoring graphs are stored on chip so the hardware monitor can quickly
access them and avoid processors stalls. However, storing graphs in on-chip memory
bounds the code size that can be monitored and, consequently, the scalability of the
solution. Increasing the on-chip memory is significantly more expensive than adding
off-chip memory. In future work, the use of off-chip memory for graph storage can
be explored as a way to enhance selective monitoring scalability. In this context,
the fast on-chip memory would be used to cache only the portions of the graph that
the hardware monitor requires in the short term. The remaining portions along with
additional graphs would be stored in off-chip memory waiting to be transferred onto
the chip as the execution of the monitored code progresses.

118

10.2.2

Multi-tenant FPGA Security

Much of the work in this dissertation strives to make FPGA multi-tenancy safer.
The following paragraphs offer directions for future research in this area. Although
the monitoring system described in Section 9.2 can suppress board reset, it is not
fast enough to stop a fault injection attack. Section 9.2.2 showed that the response
time of the monitoring system ranges from 8.7 µs to 17.4 µs. Chapter 6 and Section 8.2.3, however, showed that the sudden activation of a waster causes a large but
brief voltage drop that occurs during the very first microsecond of the attack. These
transients are capable of inducing timing faults in a ripple-carry adder. The response
time of the monitoring system depends on two factors, a) the sample period of the
sensor and b) the time it takes to analyze sensor data. The sampling period in turn
depends on the type of sensor being used. The RO-based sensor used throughout
this dissertation operates best with sampling periods in the range of microseconds.
Although this particular design was found to be effective for localizing voltage disruptions, the detection of fluctuations that are less than a microsecond is out of its
range of capabilities.
To protect against fault injection, a sensor that can detect voltage changes in
the sub-microsecond range is necessary. The results in Chapter 6 and Section 8.2.3
indicate that delay line- or hard carry-chain-based sensors represent a more suitable
option for detecting fast voltage transients. The time it takes to analyze and utilize
the voltage information extracted from the sensors must also be drastically reduced.
Although the use of an on-chip processor for analyzing sensor data simplifies the
design of the monitoring system, it bounds its response time to the microsecond
range. The analysis shown in Section 9.2.1.3 suggests that performing the analysis of
the sensor data on the FPGA hardware directly is necessary to decrease the response
time of the system.

119

As discussed in Section 9.2.4, the current prototype uses clock throttling to suppress a potential attacker. As this approach is only effective for power wasters that
depend on a clock signal for operation, additional suppression techniques should be
considered. Although not evaluated in this work, partial reconfiguration could potentially be used to suppress voltage attacks by asynchronous circuits (e.g., Figure 8.1(c)). Upon detecting an attack, response logic could initiate the partial reconfiguration of an offending region in an attempt to wipe out the attack circuitry.
For those who assume the role of an attacker, the very nature of FPGAs offers
almost unlimited malicious potential. The work in this dissertation is confined to
power wasters made from LUTs and flip-flops. Any hardened function provided on
the FPGA can potentially be used for building a power waster. These resources
include, but are not limited to, latches, block RAMs, multiplier-accumulator circuitry,
phase-locked loops, and I/O buffers.

120

BIBLIOGRAPHY

[1] Abadi, Martı́n, Budiu, Mihai, Erlingsson, Úlfar, and Ligatti, Jay. Control-flow
integrity principles, implementations, and applications. ACM Transactions on
Information and System Security (TISSEC) 13, 1 (2009), 1–40.
[2] Ahmed, Ibrahim, Zhao, Shuze, Trescases, Olivier, and Betz, Vaughn. Measure
twice and cut once: Robust dynamic voltage scaling for FPGAs. In International
Conference on Field Programmable Logic and Applications (FPL) (2016), pp. 1–
11.
[3] Alam, Md Mahbub, Tajik, Shahin, Ganji, Fatemeh, Tehranipoor, Mark, and
Forte, Domenic. RAM-Jam: Remote temperature and voltage fault attack on
FPGAs using memory collisions. In Workshop on Fault Diagnosis and Tolerance
in Cryptography (FDTC) (2019), pp. 48–55.
[4] Alibaba Cloud. Deep dive into Alibaba Cloud F3 FPGA as a service instances.
[Online]. Available: https://www.alibabacloud.com/blog/deep-dive-intoalibaba-cloud-f3-fpga-as-a-service-instances5 94057. Accessed: 201901-01.
[5] Altera Corporation. Enabling Design Separation for High-Reliability and
Information-Assurance Systems, June 2009.
[6] Amazon Web Services. Amazon EC2 F1 instances. [Online]. Available: https:
//aws.amazon.com/ec2/instance-types/f1/. Accessed: 2018-10-01.
[7] Amazon Web Services.
Clocks and Reset.
[Online]. Available:
https://github.com/aws/aws-fpga/blob/master/hdk/docs/
AWSS hellI nterfaceS pecification.md#ClocksNReset.
Accessed: 2018-1001.
[8] Amouri, Abdulazim, Hepp, Jochen, and Tahoori, Mehdi. Built-in self-heating
thermal testing of FPGAs. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems (TCAD) 35, 9 (2015), 1546–1556.
[9] Arora, Divya, Ravi, Srivaths, Raghunathan, Anand, and Jha, Niraj K. Secure
embedded processing through hardware-assisted run-time monitoring. In Design,
Automation & Test in Europe Conference & Exhibition (DATE) (2005), pp. 178–
183.

121

[10] Barbareschi, Mario, Di Natale, Giorgio, and Torres, Lionel. Implementation and
analysis of ring oscillator circuits on Xilinx FPGAs. In Hardware Security and
Trust. Springer, 2017, ch. 12, pp. 237–251.
[11] Bhunia, Swarup, Hsiao, Michael S, Banga, Mainak, and Narasimhan, Seetharam.
Hardware trojan attacks: Threat analysis and countermeasures. Proceedings of
the IEEE 102, 8 (2014), 1229–1247.
[12] Boneh, Dan, DeMillo, Richard A, and Lipton, Richard J. On the importance
of eliminating errors in cryptographic computations. Journal of cryptology 14, 2
(2001), 101–119.
[13] Caulfield, Adrian M., Chung, Eric S., Putnam, Andrew, Angepat, Hari, Fowers, Jeremy, Haselman, Michael, Heil, Stephen, Humphrey, Matt, Kaur, Puneet,
Kim, Joo-Young, Lo, Daniel, Massengill, Todd, Ovtcharov, Kalin, Papamichael,
Michael, Woods, Lisa, Lanka, Sitaram, Chiou, Derek, and Burger, Doug. A
cloud-scale acceleration architecture. In IEEE/ACM International Symposium
on Microarchitecture (MICRO) (2016), pp. 1–13.
[14] Chandrikakutty, Harikrishnan, Unnikrishnan, Deepak, Tessier, Russell, and
Wolf, Tilman. High-performance hardware monitors to protect network processors from data plane attacks. In ACM/IEEE Design Automation Conference
(DAC) (2013), pp. 1–6.
[15] Chow, C., Tsui, L.S.M., Leong, P.H.W., Luk, W., and Wilton, S.J.E. Dynamic
voltage scaling for commercial FPGAs. In International Conference on Field
Programmable Technology (FPT) (2005), pp. 173–180.
[16] Dumpala, Naveen Kumar, Patil, Shivukumar B, Holcomb, Daniel, and Tessier,
Russell. Energy efficient loop unrolling for low-cost FPGAs. In IEEE International Symposium on Field-Programmable Custom Computing Machines
(FCCM) (2017), pp. 117–120.
[17] Forrest, Stephanie, Hofmeyr, Steven, and Somayaji, Anil. The evolution of
system-call monitoring. In Annual Computer Security Applications Conference
(ACSAC) (2008), pp. 418–430.
[18] Fry, John, and Langhammer, Martin. RSA & public key cryptography in FPGAs.
Altera document (2005), 1–8.
[19] Gaisler, Jiri, and Habinc, Sandi. Grlib IP library user’s manual. Tech. rep.,
Cobham, Nov. 2017.
[20] Gartner, Inc. Gartner forecasts worldwide public cloud revenue to grow 17.5
percent in 2019. [Online]. Available: https://www.gartner.com/en/newsroom/
press-releases/2019-11-13-gartner-forecasts-worldwide-publiccloud-revenue-to-grow-17-percent-in-2020. Accessed: 2020-10-17.

122

[21] Giechaskiel, Ilias, Rasmussen, Kasper B, and Eguro, Ken. Leaky wires: Information leakage and covert communication between FPGA long wires. In
ASM Asia Conference on Computer and Communications Security (ASIACCS)
(2018), pp. 15–27.
[22] Giechaskiel, Ilias, Rasmussen, Kasper Bonne, and Szefer, Jakub. Measuring long
wire leakage with ring oscillators in cloud FPGAs. In International Conference
on Field Programmable Logic and Applications (FPL) (2019), pp. 45–50.
[23] Glamocanin, Ognjen, Coulon, Louis, Regazzoni, Francesco, and Stojilović, Mirjana. Are cloud FPGAs really vulnerable to power analysis attacks? In Design, Automation & Test in Europe Conference & Exhibition (DATE) (2020),
pp. 1007–1010.
[24] Gnad, Dennis R. E., Oboril, Fabian, Kiamehr, Saman, and Tahoori, Mehdi B. An
experimental evaluation and analysis of transient voltage fluctuations in FPGAs.
IEEE Transactions on Very Large Scale Integration Systems (TVLSI) 26, 10
(2019), 1817–1830.
[25] Gnad, Dennis RE, Oboril, Fabian, Kiamehr, Saman, and Tahoori, Mehdi B.
Analysis of transient voltage fluctuations in FPGAs. In International Conference
on Field-Programmable Technology (FPT) (2016), pp. 12–19.
[26] Gnad, Dennis RE, Oboril, Fabian, and Tahoori, Mehdi B. Voltage drop-based
fault attacks on FPGAs using valid bitstreams. In International Conference on
Field Programmable Logic and Applications (FPL) (2017), pp. 1–7.
[27] Gojman, Benjamin, Nalmela, Sirisha, Mehta, Nikil, Howarth, Nicholas, and DeHon, André. GROK-LAB: Generating real on-chip knowledge for intra-cluster
delays using timing extraction. ACM Transactions on Reconfigurable Technology
and Systems (TRETS) 7, 4 (2014), 32:1–32:23.
[28] Hu, Kekai, Wolf, Tilman, Teixeira, Thiago, and Tessier, Russell. System-level
security for network processors with hardware monitors. In ACM/IEEE Design
Automation Conference (DAC) (2014), pp. 1–6.
[29] Intel Corporation. TimeQuest Timing Analyzer Quick Start Tutorial, Dec. 2009.
[30] Intel Corporation. Intel Arria 10 Core Fabric and General Purpoe I/Os Handbook, May 2018.
[31] Intel Corporation. Cyclone V Device Handbook, may 2019.
[32] Intel Corporation. Intel Stratix 10 Analog to Digital Converter User Guide, oct
2019.
[33] Intel Corporation. Intel Stratix 10 Logic Array Blocks and Adaptive Logic
Modules User Guide, Mar. 2020. [Online]. Available: https://www.intel.com/
content/www/us/en/programmable/documentation/wtw1441782332101.html.
Accessed: 2020-05-01.
123

[34] Jin, Chenglu, Gohil, Vasudev, Karri, Ramesh, and Rajendran, Jeyavijayan. Security of cloud FPGAs: A survey. arxiv arXiv:2005.04867 (2020).
[35] Khawaja, Ahmed, Landgraf, Joshua, Prakash, Rohith, Wei, Michael, Schkufza,
Eric, and Rossbach, Christopher J. Sharing, protection, and compatibility for
reconfigurable fabric with AmorphOS. In USENIX Symposium on Operating
Systems Design and Implementation (OSDI) (2018), pp. 107–127.
[36] Klokotov, Dmitry, Shi, Jin, and Wang, Yong. Distributed modeling and characterization of on chip/system level PDN and jitter impact. In DesignCon (2014),
pp. 1–22.
[37] Knodel, Oliver, Lehmann, Patrick, and Spallek, Rainer G. RC3E: Reconfigurable
accelerators in data centres and their provision by adapted service models. In
IEEE International Conference on Cloud Computing (CLOUD) (2016), pp. 19–
26.
[38] Krautter, Jonas, Gnad, Dennis R.E., Schellenberg, Falk, Moradi, Amir, and
Tahoori, Mehdi B. Active fences against voltage-based side channels in multitenant FPGAs. In IEEE/ACM International Conference on Computer-Aided
Design (ICCAD) (2019), pp. 1–8.
[39] Krautter, Jonas, Gnad, Dennis RE, and Tahoori, Mehdi B. FPGAhammer:
Remote voltage fault attacks on shared FPGAs, suitable for DFA on AES. IACR
Transactions on Cryptographic Hardware and Embedded Systems (TCHES) 2018,
3 (2018), 44–68.
[40] Krautter, Jonas, Gnad, Dennis RE, and Tahoori, Mehdi B. Mitigating electricallevel attacks towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12, 3 (2019), 12:1–
12:26.
[41] Lenstra, Arjen K. Memo on RSA signature generation in the presence of faults.
Manuscript, 1996. Available from the author: arjen.lenstra@citicorp.com.
[42] Li, Fei, Chen, Deming, He, Lei, and Cong, Jason. Architecture evaluation for
power-efficient FPGAs. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA) (2003), pp. 175–184.
[43] Li, Fei, Lin, Yizhou, He, Lei, Chen, Deming, and Cong, Jason. Power modeling and characteristics of field programmable gate arrays. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems (TCAD) 24, 11
(2005), 1712–1724.
[44] Linear Technology. Wide Range I2C Power Monitor, Apr. 2012.
[45] Linn, Cullen, Rajagopalan, Mohan, Baker, Scott, Collberg, Christian S., Debray,
Saumya K., and Hartman, John H. Protecting against unexpected system calls.
In USENIX Security Symposium (2005), pp. 239–254.
124

[46] Maes, Roel, Schellekens, Dries, and Verbauwhede, Ingrid. A pay-per-use licensing
scheme for hardware IP cores in recent SRAM-based FPGAs. IEEE Transactions
on Information Forensics and Security (TIFS) 7, 1 (2011), 98–108.
[47] Mahmoud, Dina G., Hu, Wei, and Stojilović, Mirjana. X-Attack: Remote activation of satisfiability don’t-care hardware Trojans on shared FPGAs. In International Conference on Field Programmable Logic and Applications (FPL) (2020),
pp. 1–8.
[48] Mahmoud, Dina G., and Stojilović, Mirjana. Timing violation induced faults in
multi-tenant FPGAs. In Design, Automation & Test in Europe Conference &
Exhibition (DATE) (2019), pp. 1745–1750.
[49] Mal-Sarkar, Sanchita, Krishna, Aswin, Ghosh, Anandaroop, and Bhunia,
Swarup. Hardware Trojan attacks in FPGA devices: Threat analysis and effective countermeasures. In ACM Great Lakes Symposium on VLSI (GLSVLSI)
(2014), pp. 287–292.
[50] Mao, Shufu, and Wolf, Tilman. Hardware support for secure processing in embedded systems. IEEE Transactions on Computers (TC) 59, 6 (2010), 847–854.
[51] Matas, Kaspar, La, Tuan, Grunchevski, Nikola, Pham, Khoa, and Koch, Dirk.
Invited tutorial: FPGA hardware security for datacenters and beyond. In
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
(FPGA) (2020), pp. 11–20.
[52] Matas, Kaspar, La, Tuan Minh, Pham, Khoa Dang, and Koch, Dirk. Powerhammering through glitch amplification–attacks and mitigation. In IEEE International Symposium on Field-Programmable Custom Computing Machines
(FCCM) (2020), pp. 65–69.
[53] Matousek, Petr. Linux Kernel - ’SCTP GET ASSOC STATS()’ Stack Buffer
Overflow (PoC). [Online]. Available: https://www.exploit-db.com/exploits/
24747. Accessed: 2017-02-01.
[54] Mbongue, Joel Mandebi, Shuping, Alex, Bhowmik, Pankaj, and Bobda,
Christophe. Architecture support for FPGA multi-tenancy in the cloud. In
IEEE International Conference on Application-Specific Systems, Architectures
and Processors (ASAP) (2020), pp. 125–132.
[55] Microsoft Azure. Deploy ML models to field-programmable gate arrays
(FPGAs) with Azure Machine Learning.
[Online]. Available: https:
//docs.microsoft.com/en-us/azure/machine-learning/how-to-deployfpga-web-service. Accessed: 2019-01-01.
[56] Paar, Christof, and Pelzl, Jan. The RSA cryptosystem. In Understanding cryptography: a textbook for students and practitioners. Springer Science & Business
Media, 2009, ch. 7, pp. 173–204.
125

[57] Pellegrini, Andrea, Bertacco, Valeria, and Austin, Todd. Fault-based attack of
RSA authentication. In Design, Automation & Test in Europe Conference &
Exhibition (DATE) (2010), pp. 855–860.
[58] Pouraghily, Arman, Wolf, Tilman, and Tessier, Russell. Hardware support
for embedded operating system security. In IEEE International Conference
on Application-specific Systems, Architectures and Processors (ASAP) (2017),
pp. 61–66.
[59] Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Characterizing power distribution attacks in multi-user FPGA environments. In International Conference on Field Programmable Logic and Applications (FPL) (2019),
pp. 194–201. Best Paper Award.
[60] Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Mitigating voltage
attacks in multi-tenant FPGAs. ACM Transactions on Reconfigurable Technology
and Systems (TRETS) (2020). (under review).
[61] Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Power distribution attacks in multitenant FPGAs. IEEE Transactions on Very Large Scale
Integration Systems (TVLSI) 28, 12 (2020), 2685–2698.
[62] Provelengios, George, Holcomb, Daniel, and Tessier, Russell. Power wasting
circuits for cloud FPGA attacks. In International Conference on Field Programmable Logic and Applications (FPL) (2020), pp. 231–235.
[63] Provelengios, George, Pouraghily, Arman, Tessier, Russell, and Wolf, Tilman.
A hardware monitor to protect Linux system calls. In IEEE Computer Society
Annual Symposium on VLSI (ISVLSI) (2018), pp. 551–556.
[64] Provelengios, George, Ramesh, Chethan, Patil, Shivukumar B, Eguro, Ken,
Tessier, Russell, and Holcomb, Daniel. Characterization of long wire data leakage in deep submicron FPGAs. In ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGA) (2019), pp. 292–297.
[65] Putnam, Andrew, Caulfield, Adrian M, Chung, Eric S, Chiou, Derek, Constantinides, Kypros, Demme, John, Esmaeilzadeh, Hadi, Fowers, Jeremy, Gopal,
Gopi Prashanth, Gray, Jan, et al. A reconfigurable fabric for accelerating largescale datacenter services. In International Symposium on Computer Architecture
(ISCA) (2014), pp. 13–24.
[66] Ragel, Roshan G, and Parameswaran, Sri. Impres: integrated monitoring for
processor reliability and security. In ACM/IEEE Design Automation Conference
(DAC) (2006), pp. 502–505.
[67] Ramesh, Chethan, Patil, Shivukumar B, Dhanuskodi, Siva Nishok, Provelengios,
George, Pillement, Sébastien, Holcomb, Daniel, and Tessier, Russell. FPGA side
channel attacks without physical access. In IEEE International Symposium on
Field-Programmable Custom Computing Machines (FCCM) (2018), pp. 45–52.
126

[68] Ravi, Srivaths, Raghunathan, Anand, and Chakradhar, Srimat. Tamper resistance mechanisms for secure embedded systems. In International Conference on
VLSI Design (VLSID) (2004), pp. 605–611.
[69] Ravi, Srivaths, Raghunathan, Anand, Kocher, Paul, and Hattangady, Sunil. Security in embedded systems: Design challenges. ACM Transactions on Embedded
Computing Systems (TECS) 3, 3 (2004), 461–491.
[70] Rivest, Ronald L, Shamir, Adi, and Adleman, Leonard. A method for obtaining
digital signatures and public-key cryptosystems. Communications of the ACM
21, 2 (1978), 120–126.
[71] Schellenberg, Falk, Gnad, Dennis RE, Moradi, Amir, and Tahoori, Mehdi B. An
inside job: Remote power analysis attacks on FPGAs. In Design, Automation &
Test in Europe Conference & Exhibition (DATE) (2018), pp. 1111–1116.
[72] Shen, Linda L, Ahmed, Ibrahim, and Betz, Vaughn. Fast voltage transients on
FPGAs: Impact and mitigation strategies. In IEEE International Symposium on
Field-Programmable Custom Computing Machines (FCCM) (2019), pp. 271–279.
[73] Sugawara, Takeshi, Sakiyama, Kazuo, Nashimoto, Shoei, Suzuki, Daisuke, and
Nagatsuka, Tomoyuki. Oscillator without a combinatorial loop and its threat to
FPGA in data centre. Electronics Letters 55, 11 (2019), 640–642.
[74] Takahashi, Tomoyuki, Uezono, Takumi, Shintani, Michihiro, Masu, Kazuya, and
Sato, Takashi. On-die parameter extraction from path-delay measurements. In
IEEE Asian Solid-State Circuits Conference (ASSCC) (2009), pp. 101–104.
[75] Tehranipoor, Mohammad, and Koushanfar, Farinaz. A survey of hardware trojan
taxonomy and detection. IEEE Design & Test of Computers 27, 1 (2010), 10–25.
[76] Terasic Technologies. DE4 User Manual, Mar. 2012.
[77] Terasic Technologies. DE1-SoC User Manual, Feb. 2014.
[78] Terasic Technologies. DE5a-Net User Manual, Aug. 2018.
[79] Terasic Technologies. DE10-Pro User Manual, Nov. 2019.
[80] Texas Instruments. INA231 High- or Low-Side Measurement, Bidirectional Current and Power Monitor With 1.8-V I2C Interface, Mar. 2018.
[81] Thomas, Tedy, Pouraghily, Arman, Hu, Kekai, Tessier, Russell, and Wolf,
Tilman. Multi-task support for security-enabled embedded processors. In IEEE
International Conference on Application-specific Systems, Architectures and Processors (ASAP) (2015), pp. 136–143.

127

[82] Ueno, Miho, Hashimoto, Masanori, and Onoye, Takao. Real-time supply voltage
sensor for detecting/debugging electrical timing failures. In IEEE International
Symposium on Parallel & Distributed Processing, Workshops and Phd Forum
(IPDPS) (2013), pp. 301–305.
[83] Wolf, Tilman, Chandrikakutty, Harikrishnan Kumarapillai, Hu, Kekai, Unnikrishnan, Deepak, and Tessier, Russell. Securing network processors with highperformance hardware monitors. IEEE Transactions on Dependable and Secure
Computing (TDC) 12, 6 (2014), 652–664.
[84] Xie, Shuang, and Ng, Wai Tung. Delay-line temperature sensors and VLSI
thermal management demonstrated on a 60nm FPGA. In IEEE International
Symposium on Circuits and Systems (ISCAS) (2014), pp. 2571–2574.
[85] Xilinx Corporation. Vivado Isolation Design Flow, Sept. 2016.
[86] Yazdanshenas, Sadegh, and Betz, Vaughn. Interconnect solutions for virtualized
field-programmable gate arrays. IEEE Access 6 (2018), 10497–10507.
[87] Yu, Haile, Xu, Qiang, and Leong, Philip HW. Fine-grained characterization of
process variation in FPGAs. In International Conference on Field-Programmable
Technology (FPT) (2010), pp. 138–145.
[88] Zhang, Xuzhi, Shao, Xiaozhe, Provelengios, George, Dumpala, Naveen Kumar, Gao, Lixin, and Tessier, Russell. Scalable network function virtualization
for heterogeneous middleboxes. In IEEE International Symposium on FieldProgrammable Custom Computing Machines (FCCM) (2017), pp. 219–226.
[89] Zhao, Mark, and Suh, G Edward. FPGA-based remote power side-channel attacks. In IEEE Symposium on Security and Privacy (S&P) (2018), pp. 229–244.
[90] Zhao, Shuze, Ahmed, Ibrahim, Betz, Vaughn, Lotfi, Ashraf, and Trescases,
Olivier. Frequency-domain power delivery network self-characterization in FPGAs for improved system reliability. IEEE Transactions on Industrial Electronics
(TIE) 65, 11 (2019), 8915–8924.
[91] Zick, Kenneth M, and Hayes, John P. On-line sensing for healthier FPGA systems. In ACM/SIGDA International Symposium on Field Programmable Gate
Arrays (FPGA) (2010), pp. 239–248.
[92] Zick, Kenneth M, and Hayes, John P. Low-cost sensing with ring oscillator
arrays for healthier reconfigurable systems. ACM Transactions on Reconfigurable
Technology and Systems (TRETS) 5, 1 (2012), 1:1–1:26.
[93] Zick, Kenneth M, Srivastav, Meeta, Zhang, Wei, and French, Matthew. Sensing nanosecond-scale voltage attacks and natural transients in FPGAs. In
ACM/SIGDA International Symposium on Field Programmable Gate Arrays
(FPGA) (2013), pp. 101–104.

128

[94] Ziener, Daniel, Baueregger, Florian, and Teich, Jürgen. Using the power side
channel of FPGAs for communication. In IEEE International Symposium on
Field-Programmable Custom Computing Machines (FCCM) (2010), pp. 237–244.

129

