Low-overhead fault-tolerant logic for field-programmable gate arrays by Davis, James
Imperial College London
Department of Electrical and Electronic Engineering
Low-overhead Fault-tolerant Logic for
Field-programmable Gate Arrays
James J. Davis
November 2015
Supervised by Professor Peter Y. K. Cheung
Submitted in part fulfilment of the requirements for the degree of
Doctor of Philosophy in Electrical and Electronic Engineering of Imperial College London
and the Diploma of Imperial College London
1
Licence
The copyright of this thesis rests with the author and is made available under a Creative
Commons Attribution Non-commercial No Derivatives licence. Researchers are free to
copy, distribute or transmit the thesis on the condition that they attribute it, that they
do not use it for commercial purposes and that they do not alter, transform or build upon
it. For any reuse or redistribution, researchers must make clear to others the licence terms
of this work.
2
Statement of Originality
The work in this thesis is my own, except where it has been appropriately referenced and
attributed.
3
Abstract
While allowing for the fabrication of increasingly complex and efficient circuitry, tran-
sistor shrinkage and count-per-device expansion have major downsides: chiefly increased
variation, degradation and fault susceptibility. For this reason, design-time considera-
tion of faults will have to be given to increasing numbers of electronic systems in the
future to ensure yields, reliabilities and lifetimes remain acceptably high. Many mathe-
matical operators commonly accelerated in hardware are suited to modification resulting
in datapath error detection and correction capabilities with far lower area, performance
and/or power consumption overheads than those incurred through the utilisation of more
established, general-purpose fault tolerance methods such as modular redundancy. Field-
programmable gate arrays are uniquely placed to allow further area savings to be made
thanks to their dynamic reconfigurability.
The majority of the technical work presented within this thesis is based upon a bench-
mark hardware accelerator—a matrix multiplier—that underwent several evolutions in
order to detect and correct faults manifesting along its datapath at runtime. In the first
instance, fault detectability in excess of 99% was achieved in return for 7.87% additional
area and 45.5% extra latency. In the second, the ability to correct errors caused by those
faults was added at the cost of 4.20% more area, while 50.7% of this—and 46.2% of the
previously incurred latency overhead—was removed through the introduction of partial
reconfiguration in the third. The fourth demonstrates further reductions in both area
and performance overheads—of 16.7% and 8.27%, respectively—through systematic data
width reduction by allowing errors of less than ±0.5% of the maximum output value to
propagate.
4
Acknowledgements
My thanks, firstly, go to my supervisor, Peter Cheung. I could not have asked for a better
supervisory experience: Peter has allowed me the freedom to explore and develop largely
automomously while always remaining able to provide fresh insight and offer support
when it was needed. As the lead academic of the weekly ‘Reliability Club’ meetings
and supervisor for my current research assistantship, George Constantinides has played a
pivotal role in my previous, and ongoing, research, including much of the content of this
thesis. Both David Thomas and Christos Bouganis have also provided perspective for the
work, for which I am grateful.
To all other members of the Reliability Club, both past and present—Rosella Arcucci,
Sam Bayliss, David Boland, Sumanta Chaudhuri, Rui Duarte, Zhenyu Guan, Eddie Hung,
Josh Levine, Karthick Parashar, Kan Shi, Ed Stott, Michail Vavouras, Justin Wong and
Rong Ye—I owe thanks for guidance and suggestions, without which much of the work
in this thesis would not have been possible. Peter Ogden’s advice and assistance has also
been invaluable.
Members of the Circuits and Systems group, many of whom have already been men-
tioned, have provided help and relief from the stresses of work over the past few years,
for both of which I am thankful. In particular, Wiesia Hsissen has been instrumental in
allowing me to complete my research while never knowingly missing a deadline.
Special thanks go to my friends Ed Stott and Cate Slade for their hospitality over the
recent months. Their kindness and generosity have allowed me to complete this thesis safe
in the knowledge that I have a warm and happy home to return to each evening.
Many thanks to lifelong friends Pete & Carol Miller for their boundless encouragement.
My friends across the Atlantic—particularly Dan & Lisa Albright, Clark & Kelly Grace,
Jo Dee & Carl Schultz and Joe Smith—as well as my sister, Louise, have done a fantastic
job of keeping me entertained during periods of—and please pardon the pun—downtime.
I am of course grateful to my parents, Chris and Mary, for everything that they have
done, and continue to do, for me. I would not be who or where I am today without their
5
encouragement and support.
Last but by no means least, my thanks go to my partner and best friend, Melanie
Albright, for her unwavering support and love.
Thank you.
6
Contents
1 Introduction 18
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Background 22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Faults & Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Oﬄine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Online (Roving) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 Online (Health Monitoring) . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Fault Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6.1 Compile-time Provisioning . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6.2 Runtime Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Algorithm-based Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.1 Matrix Encoding & Decoding . . . . . . . . . . . . . . . . . . . . . . 37
2.7.2 Application to Arithmetic Operations . . . . . . . . . . . . . . . . . 41
2.7.3 Result Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Algorithm-tailored Low-overhead Online Error Detection 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7
3.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Hardware-software Platform . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.3 Checksum Generation & Verification . . . . . . . . . . . . . . . . . . 52
3.2.4 Fault Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.5 Error Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Fault Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Error Correction via Runtime Resource Reallocation 68
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Additional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.2 Partial Routing Reconfiguration . . . . . . . . . . . . . . . . . . . . 71
4.3 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Fault Observability for Matrix & DSP Operations 84
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Matrix-matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Matrix Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8
5.5 Matrix-vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Reduced-precision Algorithm-based Fault Tolerance 103
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Principles of RP-ABFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.1 MSB-first Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.2 LSB-first Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2 MSB-first Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.3 LSB-first Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Fault Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7 Conclusion 136
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9
List of Tables
2.1 Comparison of fault detection methods . . . . . . . . . . . . . . . . . . . . . 45
3.1 Baseline & ABFT-protected accelerator resource usage . . . . . . . . . . . . 59
3.2 Baseline & ABFT-protected accelerator performance . . . . . . . . . . . . . 61
4.1 Baseline & ABFT-protected accelerator with additional logic & DPR error
correction resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 ABFT-protected accelerator with additional logic & DPR error correction
performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 ABFT-protected accelerator with DPR error correction bitstream storage
requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 Baseline, MSB-first & LSB-first truncated accelerator LUT usage . . . . . . 127
6.2 Baseline, MSB-first & LSB-first truncated accelerator FF usage . . . . . . . 128
6.3 Baseline, MSB-first & LSB-first truncated accelerator BRAM & DSP usage 128
6.4 Baseline, MSB-first & LSB-first truncated accelerator total resource usage . 129
6.5 Baseline, MSB-first & LSB-first truncated accelerator fmax . . . . . . . . . 129
6.6 RP-ABFT input & output checksum widths . . . . . . . . . . . . . . . . . . 130
10
List of Figures
2.1 Test arrangement used by Stroud et al. [1] . . . . . . . . . . . . . . . . . . . 26
2.2 Delay measurement method proposed by Wong et al. [2] . . . . . . . . . . . 27
2.3 Visualisation of the roving test scheme proposed by Abramovici et al. [3] . . 28
2.4 Timing slack measurement method proposed by Levine et al. [4] . . . . . . 31
2.5 Wear-levelling strategies proposed by Stott et al. [5] . . . . . . . . . . . . . 31
2.6 Visualisation of the repair scheme proposed by Lach et al. [6] . . . . . . . . 33
2.7 Visualisation of the repair scheme proposed by Hanchek et al. [7] . . . . . . 34
2.8 Visualisation of the repair scheme proposed by Emmert et al. [8] . . . . . . 35
2.9 Evolutionary repair scheme proposed by DeMara et al. [9] . . . . . . . . . . 36
3.1 Top-level system block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Baseline datapath for matrix multiplication . . . . . . . . . . . . . . . . . . 49
3.3 Pipelined accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 ABFT-protected matrix multiplication datapath . . . . . . . . . . . . . . . 53
3.5 Checksum generation logic for matrix multiplication accelerator . . . . . . . 53
3.6 Checksum verification logic for matrix multiplication accelerator . . . . . . 54
3.7 ABFT-protected accelerator resource usage overhead versus baseline . . . . 60
3.8 ABFT-protected accelerator latency overhead versus baseline . . . . . . . . 62
3.9 Fault proportions for ABFT-protected matrix multiplication . . . . . . . . . 65
3.10 Fault proportions for ABFT-protected matrix multiplication scaled by area 66
4.1 ABFT-enabled datapath with circular shifters . . . . . . . . . . . . . . . . . 70
4.2 Resource reallocation for s = 2 with single fault using circular shifters . . . 71
4.3 Circular shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 System block diagram with DPR . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 ABFT-enabled datapath with DPR . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Routing configurations available for s = 2 . . . . . . . . . . . . . . . . . . . 74
4.7 Resource reallocation for s = 2 with single fault using DPR . . . . . . . . . 75
11
4.8 Example double-fault routing reconfigurations when s = 4 . . . . . . . . . . 76
4.9 Example quadruple-fault routing reconfigurations when s = 8 . . . . . . . . 76
4.10 ABFT-protected accelerator with additional logic & DPR error correction
resource usage overhead versus baseline . . . . . . . . . . . . . . . . . . . . 78
4.11 ABFT-protected accelerator with additional logic & DPR error correction
combined resource usage overhead versus baseline . . . . . . . . . . . . . . . 79
4.12 ABFT-protected accelerator with additional logic & DPR error correction
latency overhead versus baseline . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1 Matrix-matrix multiplication permanent fault observability . . . . . . . . . 89
5.2 Matrix-matrix multiplication permanent fault locatability . . . . . . . . . . 90
5.3 Matrix-matrix multiplication transient fault observability . . . . . . . . . . 91
5.4 Matrix-matrix multiplication transient fault locatability . . . . . . . . . . . 92
5.5 Matrix addition permanent fault observability . . . . . . . . . . . . . . . . . 93
5.6 Matrix addition permanent fault locatability . . . . . . . . . . . . . . . . . 94
5.7 Matrix addition transient fault observability . . . . . . . . . . . . . . . . . . 95
5.8 Matrix addition transient fault locatability . . . . . . . . . . . . . . . . . . 96
5.9 Matrix-vector multiplication permanent fault observability . . . . . . . . . . 97
5.10 Matrix-vector multiplication transient fault observability . . . . . . . . . . . 98
6.1 Datapath with zero truncation . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Checksum generation logic with zero truncation . . . . . . . . . . . . . . . . 107
6.3 Checksum verification logic with zero truncation . . . . . . . . . . . . . . . 108
6.4 Datapath with MSB-first truncation . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Checksum generation logic with MSB-first truncation . . . . . . . . . . . . 109
6.6 Checksum verification logic with MSB-first truncation . . . . . . . . . . . . 109
6.7 Datapath with LSB-first truncation . . . . . . . . . . . . . . . . . . . . . . . 110
6.8 Checksum generation logic with LSB-first truncation . . . . . . . . . . . . . 111
6.9 Checksum verification logic with LSB-first truncation . . . . . . . . . . . . . 112
6.10 RP-ABFT with MSB-first truncation-protected accelerator resource usage
overhead versus baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.11 RP-ABFT with MSB-first truncation-protected accelerator combined re-
source usage overhead versus baseline . . . . . . . . . . . . . . . . . . . . . 114
12
6.12 RP-ABFT with LSB-first truncation-protected accelerator resource usage
overhead versus baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.13 RP-ABFT with LSB-first truncation-protected accelerator combined re-
source usage overhead versus baseline . . . . . . . . . . . . . . . . . . . . . 116
6.14 RP-ABFT with MSB-first truncation-protected accelerator fmax versus
baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.15 RP-ABFT with LSB-first truncation-protected accelerator fmax versus
baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.16 Detected fault proportions for RP-ABFT-protected accelerator . . . . . . . 119
6.17 False positive fault proportions for RP-ABFT-protected accelerator . . . . . 120
6.18 False negative fault proportions for RP-ABFT-protected accelerator . . . . 121
6.19 Masked fault proportions for RP-ABFT-protected accelerator . . . . . . . . 122
6.20 Located fault proportions for RP-ABFT-protected accelerator . . . . . . . . 123
6.21 Means of maximum absolute errors for RP-ABFT-protected accelerator . . 124
6.22 Area-scaled detected fault proportions for RP-ABFT-protected accelerator . 125
6.23 Area-scaled false positive fault proportions for RP-ABFT-protected accel-
erator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.24 Area-scaled false negative fault proportions for RP-ABFT-protected accel-
erator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.25 Area-scaled masked fault proportions for RP-ABFT-protected accelerator . 134
6.26 Area-scaled located fault proportions for RP-ABFT-protected accelerator . 135
13
List of Symbols & Abbreviations
a accumulator latency
⌈ ⌉ ceiling
csc column-wise checksum vector
csr row-wise checksum vector
csin input checksum element
csout output checksum element
csout, c corner output checksum element
d distance
∆ change in
din input data element
dout output data element
ǫ maximum absolute error
f number of simultaneous faults
⌊ ⌋ floor
fmax timing model-inferred maximum operating frequency
m multiplier latency
µ mean
n data width
r truncation width
s square matrix size
θ output checksum element error threshold
θc corner output checksum element error threshold
14
∨ maximum absolute value
ABFT algorithm-based fault tolerance
AMD Advanced Micro Devices
ARM Acorn reduced instruction set computer machine
ASIC application-specific integrated circuit
AXI Advanced eXtensible Interface
BIST built-in self-test
BRAM block random-access memory
BUT block-under-test
CAD computer-aided design
CLK clock
CPU central processing unit
CTR counter
CUT circuit-under-test
DFT discrete Fourier transform
DMA direct memory access
DPR dynamic partial reconfiguration
DRAM dynamic random-access memory
DSP digital signal processing block
EEPROM electronically erasable programmable read-only memory
EM electromigration
FF flip-flop
FFT fast Fourier transform
FIR finite impulse response
FPGA field-programmable gate array
FSM finite state machine
GPU graphics processing unit
HCI hot carrier injection
15
HDL hardware description language
I/O input/output
IIR infinite impulse response
ISE Integrated Synthesis Environment
LB logic block
LR launch register
LSB least-significant bit
LU lower- and upper-triangular
LUT look-up table
MAC multiply-accumulator
MOS metal oxide semiconductor
MSB most-significant bit
MUX multiplexer
NBTI negative-bias temperature instability
NMOS n-type metal oxide semiconductor
OpenCL Open Computing Language
ORA output response analyser
PBTI positive-bias temperature instability
PCAP processor configuration access port
PL programmable logic
PLB programmable logic block
PMOS p-type metal oxide semiconductor
PS processor subsystem
PUT paths-under-test
RAM random-access memory
RP-ABFT reduced-precision algorithm-based fault tolerance
SA set of healthy spare logic cells
SA0 stuck-at-zero
16
SA1 stuck-at-one
SG sample register
SM set of logical functions assigned to faulty logic cells
SoC system-on-chip
SP signal path
SRAM static random-access memory
STAR self-testing area
TAC transition activity counter
TCG test clock generator
TDDB time-dependent dielectric breakdown
TMR triple modular redundancy
TPA transition probability analyser
TPG test pattern generator
TVG test vector generator
VHDL very high-speed integrated circuit hardware description language
XOR exclusive-OR
17
1 Introduction
Aggressive process scaling leads to increasing uncertainty in the behaviour of metal oxide
semiconductor (MOS) transistors, in turn decreasing their reliability. Above transistor
level, such phenomena and the faults they induce can lead to reduced yield, decreased
system reliability and, in extreme cases, total failure after a period of successful operation.
Although error detection and correction are almost always considered for highly sensitive
and susceptible applications such as those to be deployed in space or the battlefield, they
are often overlooked for other, more general-purpose applications; this is likely to have to
change in the future as the effects caused by such scaling continue to worsen. Self-testing
and -repairing reconfigurable circuits may well present themselves to be viable facilitators
for overcoming the reliability problems that transistor scaling causes.
Modern field-programmable gate arrays (FPGAs) are often exploited for their ability to
realise high-performance hardware, typically through parallelisation and pipelining, with-
out the high setup costs associated with application-specific integrated circuits (ASICs).
As devices built from many millions or even billions of transistors, FPGAs are at least as
susceptible to the mechanisms degrading the switching performance, eventually leading to
the inoperability, of those devices as ASICs. Furthermore, FPGAs’ reliance upon random-
access memory (RAM) for configuration storage intensifies the potential for damage from
radiation-induced upsets, since such occurrences can alter circuit-defining configuration
in addition to corrupting data. They do, however, make ideal platforms upon which
to develop and verify fault-tolerant techniques: their resource abundances, hierarchical
structures and runtime reconfigurability present many exciting possibilities not only for
the creation of elaborate configurations, but also for the ability to prevent, detect, analyse
and/or correct faults (from) occurring during their lifetimes.
While the use of common fault tolerance strategies frequently causes the incursion of
significant overheads in area, performance and/or power consumption, options exist that
buck these trends. In particular, algorithm-based fault tolerance (ABFT) embodies a
proven family of low-overhead error mitigation techniques able to be built upon to create
18
self-verifying circuitry. ABFT protection can be applied to a wide range of linear algebraic
mathametical operations—commonly accelerated in hardware—making it a suitable basis
for the hardening of custom logic.
This thesis is representative of several years’ research into the design, implementation
and verification of ‘bolt-on’ error detection logic used to facilitate runtime fault tolerance
of FPGA-implemented hardware accelerators. ABFT is shown to be an effective error
detection tool through a case study, which is expanded upon to allow for the reduction
of algorithmic parallelisation in order to maintain accurate operation under faulty con-
ditions. Further case studies are considered when analysing the fault observability of
ABFT-protected operators. The sacrification of some detectability, particularly of faults
associated with low-magnitude errors, is explored in order to achieve additional gains in
area and power efficiency.
1.1 Contributions
The original contributions of the work presented in this thesis are:
• The implementation of a complete hardware-software platform for the verification
of ABFT protection of a benchmark matrix multiplication accelerator.
• A quantitative analysis of the overheads—in terms of area and performance—
incurred through the incorporation of ABFT protection within that benchmark
circuit.
• Insight into the hardware fault tolerance of ABFT upon that benchmark.
• The first implementation of custom logic for error correction in the presence of faulty
resources guided by an ABFT error detection mechanism.
• The first implementation of ABFT-protected hardware using dynamic partial recon-
figuration (DPR) for recovery.
• A quantitative analysis of the overheads—of resources, performance and memory—
incurred through the incorporation of those error correction strategies into a bench-
mark hardware accelerator.
• A software framework for fault simulation in hardware-accelerated linear algebra
operators protected with ABFT.
19
• A thorough analysis of the fault tolerance of three benchmark ABFT-protected op-
erators.
• The first consideration of distance-x, for x > 2, ABFT application in custom logic.
• The first consideration of distinct data and checksum bit-widths within
ABFT-protected operations: reduced-precision algorithm-based fault toler-
ance (RP-ABFT).
• The first implementation of circuitry incorporating RP-ABFT for resilience against
hardware faults.
• Analysis of the costs and benefits of applying two forms of RP-ABFT to various
precisions.
• Insight into the hardware fault tolerance of RP-ABFT.
1.2 Publications
The original contributions made in this thesis and related work have been published as
peer-reviewed conference papers in the following publications:
• J. J. Davis and P. Y. K. Cheung, “Datapath Fault Tolerance for Parallel Accelera-
tors,” in International Conference on Field-programmable Technology (FPT), 2013.
• J. J. Davis and P. Y. K. Cheung, “Reducing Overheads for Fault-tolerant Datapaths
with Dynamic Partial Reconfiguration,” in IEEE International Symposium on Field-
programmable Custom Computing Machines (FCCM), 2014.
• J. J. Davis and P. Y. K. Cheung, “Achieving Low-overhead Fault Tolerance for
Parallel Accelerators with Dynamic Partial Reconfiguration,” in International Con-
ference on Field-programmable Logic and Applications (FPL), 2014.
• J. J. Davis and P. Y. K. Cheung, “Reduced-precision Algorithm-based Fault Toler-
ance for FPGA-implemented Custom Accelerators,” in International Workshop on
Applied Reconfigurable Computing (ARC), 2016.
1.3 Outline
The remainder of this thesis is organised as follows. Chapter 2 primarily consists of a
thorough literature review, with research having been focussed upon schemes proposed
20
to achieve fault tolerance—testing methods, techniques designed to mitigate degradation
and fault-correction schemes—with particular attention paid to those for achieving runtime
fault tolerance in FPGAs. Chapter 2 also introduces ABFT, the error detection mechanism
of which represents the foundation for much of this thesis’ technical content. In Chapter 3,
an implementation for the detection of runtime data errors occurring within an ABFT-
protected hardware accelerator is presented. This work is expanded upon in Chapter 4
in order to create a fault-tolerant hardware platform: logic capable of detecting faults
within its datapath, and autonomously acting to rearrange itself in order to route around
them, at runtime. Chapter 5 forms an in-depth investigation into the observability of faults
within ABFT-protected datapaths, a software simulation framework having been designed
to ascertain fault observabilities for three hardened mathematical operations. Chapter 6,
the final technical chapter, presents research into the application of ABFT at lower levels
of precision than in prior work, introducing a previously unexplored area-to-allowed error
tradeoff. Concluding remarks are presented in Chapter 7.
21
2 Background
2.1 Introduction
In order to establish the current state of the art, and to identify gaps and promising direc-
tions in the literature to date, a thorough review was completed. Research focussed upon
the three aspects vital to the exploitation of runtime reconfiguration for fault-tolerance in
field-programmable gate arrays (FPGAs): testing methods, techniques designed to miti-
gate degradation and fault correction schemes.
2.1.1 Outline
The remainder of this chapter is organised as follows. Section 2.2 contains discussion of
the various types of fault that affect metal oxide semiconductor (MOS) transistors, while
Section 2.3 gives an overview of the mechanisms that contribute to their degradation over
time. Section 2.4 contains details of methods that can be employed for the detection of
faults and monitoring of degradation effects, grouped into either oﬄine (Section 2.4.1),
online via roving scan (Section 2.4.2) or online via health monitoring (Section 2.4.3) cate-
gories, as appropriate. Fault mitigation—pre-emptive action—is discussed in Section 2.5,
while error correction—reactive action—is considered in Section 2.6. Error correction
methods are classified depending on whether they require compile-time provisioning (Sec-
tion 2.6.1) or are free to provision fully at runtime (Section 2.6.2). Section 2.7 introduces
algorithm-based fault tolerance (ABFT); the subsections within it detail the ABFT check-
summing procedures (Section 2.7.1) and demonstrate their application to applicable oper-
ators (Section 2.7.2) by numerical example. A method for classifying results based upon
the locations of incorrectly computed elements is presented in Section 2.7.3. Concluding
remarks are given in Section 2.8.
22
2.2 Faults & Errors
The distinction between various types of faults and the errors they cause is central to the
concepts presented in this thesis.
Broadly speaking, faults fall into one of two categories: permanent and transient. Per-
manent faults are those present as a result of manufacturing defects or, potentially, that
manifest after a period of time due to the effects of degradation. Stuck-at-zero (SA0) or
stuck-at-one (SA1), respectively describing nets that become stuck at low and high logic
levels, and open and short circuits are the permanent faults most often modelled. Timing
faults—slow-switching transistors being the obvious example—represent a subcategory of
permanent fault whereby errors appear only under certain timing-related conditions, such
as at certain operating frequencies. Transient faults are those usually caused by radiation,
generally by the ‘flipping’ of bits within memories. Such faults may lead to knock-on er-
rors occurring until they are corrected, for example by overwriting values stored in affected
memory locations.
The nature of FPGAs blurs the line between permanent and transient faults, since con-
figuration memory upsets—traditionally considered to be transient—can cause continuing
incorrect circuit functionality; classically a symptom of permanent faults. For this reason,
the concept of hard and soft faults is introduced: hard faults are those that change system
behaviour via physical circuit alteration, while soft faults are those that do so through
configuration changes.
In the work presented in the technical chapters of this thesis, data errors (i.e. incorrectly
computed values) are used to infer the presence and location of the faults that cause them,
thus allowing for corrective action to be taken to prevent those faults causing errors in the
future. In terms of error correction, the focus is generally on permanent hard faults since,
as outlined in Sections 2.5 and 2.6, fault tolerance strategies exist that are considered to
combat other types more adequately.
2.3 Degradation
Five main mechanisms contribute to the physical degradation of transistors in FPGAs
and other MOS-based devices [10] [11]: negative-bias temperature instability (NBTI) and
positive-bias temperature instability (PBTI), hot carrier injection (HCI), electromigration
(EM) and time-dependent dielectric breakdown (TDDB). All are predicted to result in
23
worsening effects, shortening device lifetime, as feature sizes decrease due to corresponding
increases in gate field strengths and current densities [12].
NBTI and PBTI, respectively applicable to p-type metal oxide semiconductor (PMOS)
and n-type metal oxide semiconductor (NMOS) transistors, along with HCI, cause charges
to become trapped in their gate-channel interfaces, increasing threshold voltages and re-
ducing channel mobilities [13]. These effects result in decreased switching speeds, leading
to timing faults. EM can lead to bridging faults (short circuits) caused by the movement
of metal ions within and between transistor interconnects, while TDDB gives rise to the
creation of shorts across transistors via a trapping of charges in gate oxides, resulting
in increased gate leakage that promotes further charge trapping, and so on in a cyclical
fashion [10].
NBTI and PBTI are considered to be the dominant mechanisms causing degradation in
modern FPGAs [13] [14]; this is particularly true of those with high-K gate dielectrics and
metal gates. NBTI promoted by static-zero logic inputs has been shown to cause the most
rapid degradation in the look-up tables (LUTs) of Altera Cyclone III [15] FPGAs [5].
2.4 Fault Detection
Before any type of error correction can be applied, the faults causing those errors must
be detected and the offending resources located. The following three subsections describe
testing methods suitable for the detection and location of all fault types, both with regards
to logical function, i.e. correct output, and timing. Schemes can be considered to be either
oﬄine or online depending on whether they are carried out independently to an FPGA’s
application configuration (oﬄine) or not (online). Online methods can be further classified
as operating either in tandem with, but distinct from, the application (roving scans), or
as part of the application configuration itself (health monitoring). Note that many of
the methods presented as oﬄine could be adapted to become online roving schemes and
vice-versa.
Although academic research into the detection of hard fault-causing defects is mature
(focus has now shifted towards the detection of timing faults and process variation, while
some defect-tolerant schemes have been adopted commercially), they are discussed here
since they are still relevant to mitigation and correction schemes that require hard fault
detection and location.
24
2.4.1 Oﬄine
During system start-up, or in applications where periodic downtime of an FPGA is possi-
ble, one or more dedicated testing configurations can be loaded onto a device in order to
test its resources. Self-contained oﬄine testing methods, i.e. ones which operate without
external test hardware, are known as built-in self-test (BIST) schemes. Typically, logic
is configured as test pattern generators (TPGs), circuits-under-test (CUTs) and output
response analysers (ORAs), with placements of these often exchanged across test phases
to verify all resources. Oﬄine testing methods are able to achieve high fault coverage due
to the complete flexibility presented by having an unused array to test, impose no restric-
tions upon the application configurations that can be implemented on the FPGAs they
are designed to test and can locate dormant faults, i.e. those that exist in resources not
used by the application configuration. They do, however, require application circuitry to
be taken entirely oﬄine and, where periodic testing is employed, fault detection latencies
will be limited by the frequency of that testing.
Programmable logic block (PLB) functional testing is a highly researched
area [1] [16] [17]. Groups of logic are often cascaded to form paths-under-test (PUTs) in
order to lower testing times by reducing the numbers of reconfigurations required [16] [17].
Figure 2.1 shows one such test arrangement [1]: C groups of m-bit TPGs, where m is the
number of inputs to each block-under-test (BUT), are used to drive C groups of n BUTs
connected in parallel. The O outputs of each BUT is then fed into the corresponding O
groups of n ORAs in order to compare behaviours across the C groups of BUTs. Testing
covering entire FPGAs has been achieved with detection granularities down to one in five
LUTs [17].
Functional testing specific to the location of interconnect faults has also been dis-
cussed [18] [19] [20], with some originally presented methods [18] extended [21] to allow
fault locations to be established once they have been detected. Hierarchical techniques
suitable for testing cluster-based FPGAs have also been proposed [19]. Recent work [20]
has resulted in a testing scheme for both interconnect and PLBs that obtains 100% fault
coverage.
Consideration to timing fault BIST in PLBs has been given [22], and it has been shown
that LUT propagation delays are dependent upon both the functions they implement
and the input patterns applied to them. The delay measurement method presented by
Wong et al. [2] allows the maximum operating frequency of an arbitrary combinatorial
25
Figure 2.1: Test arrangement used by Stroud et al. [1], exemplifying the functional testing
of PLBs with TPGs, CUTs and ORAs.
and/or sequential circuit to be established by recording the transition probabilities, i.e.
likelihoods that logic levels change between cycles, at its output as the clock frequency
is swept. Steps in plots of transition probability versus frequency allow rising and falling
edge propagation delays to be estimated. Figure 2.2 shows the required arrangement
of measurement circuitry around the arbitrary logic being tested: an N -bit test vector
generator (TVG) is used to supply inputs to the CUT, sandwiched between registers
clocked at varying frequencies by a test clock generator (TCG). The M outputs of the
CUT are then analysed by the transition activity counter (TAC) and transition probability
analyser (TPA) that follow. Frequency estimates have been shown to be within 12% of
those obtainable using an alternative, exhaustive testing procedure [23].
Timing fault testing of FPGA interconnect has also been explored [24] [25] [26]. Two
test configurations are presented by Wang et al. [25]: the first using feedback to alleviate
clock skew and the second for validating that skew across PLBs, with a combination of
the two reported to achieve defect coverage of over 98%. A more recent method [26] was
shown to achieve 100% diagnostic resolution in the presence of individual, and almost
100% with pairs of, defects.
Oﬄine testing methods have also been proposed for testing less frequently considered
FPGA hardware. BIST techniques for embedded multipliers have been presented [27] [23],
while work addressing testing of digital signal processing blocks [28] and embedded memo-
ries [29] has also been completed. Recent publications have addressed timing fault testing
in clock networks [30] [31] and input/output hardware [32].
26
Figure 2.2: Delay measurement method proposed by Wong et al. [2] that allows the maxi-
mum operating frequency of an arbitrary combinatorial and/or sequential cir-
cuit to be established by recording the transition probabilities at its output as
the clock frequency is swept.
2.4.2 Online (Roving)
Dynamic partial reconfiguration (DPR) can be exploited to allow portions of an FPGA to
be tested while the remainder continues to perform its normal functions. Once verification
of particular areas has been completed, test hardware is moved to allow different areas to be
verified. While an entire array does not need to be reconfigured to facilitate testing, as in
oﬄine testing methods, the requirement to introduce temporary testing areas into designs
limits maximum resource usage and forces clock slowdown due to path-lengthening effects:
required slowdowns of around 2.5–15.1% have been reported [3]. Roving scans present
opportunities to detect dormant faults, but often introduce significant fault detection
latency: chip-wise scans have been reported to take 850ms [3], implying worst-case fault
detection latencies of the same amount.
Abramovici et al. presented a roving scheme [3] that tests PLB and interconnect func-
tionality by configuring a chip-wide row and column of PLBs as self-testing areas (STARs).
Complete rows and columns are used to allow testing of horizontal and vertical global
routing resources, respectively. Each PLB within each STAR is used as part of either a
TPG, CUT or ORA, with these roles rotated regularly to allow each PLB to be tested in
all modes. Over time, the STARs are moved in order to test all PLB and interconnect
resources. Figure 2.3 shows the roving principle for a column of resources: application
functionality is shifted to a neighbouring column such that the column-under-test is moved
by one place. Improvements to the diagnosis techniques originally used have been pro-
27
posed [33] [34] [35]. Changes have been suggested that allow individual or pairs of faulty
PLBs to be identified without reconfiguration, reducing the overall diagnosis time [33].
A new testing architecture has been presented [34], reportedly able to identify 96% of
faulty PLBs in an FPGA with up to 10% of randomly distributed faulty logic resources:
a 38% improvement over the original diagnosis scheme. Focussed upon the location of
interconnect faults, another modified testing scheme [35] reduces detection latency by an
order of magnitude over its predecessor by greatly reducing the number of reconfigurations
required. A divide-and-conquer approach is used to identify points of failure and achieve
high diagnosability, including in the presence of multiple faults: 99.3% fault coverage was
reported with a fault density of 10%.
Figure 2.3: Visualisation of the roving test scheme proposed by Abramovici et al. [3], show-
ing the roving principle for a column of resources: application functionality is
shifted to a neighbouring column such that the column-under-test is moved by
one place.
Architectural modifications to facilitate online PLB functional fault detection have been
proposed [36]: a roving scan scheme that reserves a column of resources at design-time to
facilitate testing was suggested. Detection latencies are low—hundreds or thousands of
chip-wise scans per second were reported to be possible—but an additional memory and
multiplexer must be added to each PLB.
The roving STAR scheme [3] was used as an online testing framework for a timing fault
identification method [37] which proposed configuring logic and routing resources to allow
propagation delays between pairs of PUTs to be compared.
2.4.3 Online (Health Monitoring)
While oﬄine and roving methods configure FPGA resources with temporary testing cir-
cuitry to exercise them, health monitoring schemes use permanent, additional hardware
to monitor the state of the application. Faults of all types can be detected in monitored
hardware, and such techniques offer the lowest possible detection latencies at the expense
28
of consuming varying amounts of additional area and/or causing the incursion of various
levels of throughput reduction.
Modular redundancy methods involve the duplication of logic for specific operations,
allowing them to be performed, in parallel, more than once. Voting circuitry is used
at the output of the processing logic to detect discrepancies indicative of faults. Triple
modular redundancy (TMR) [38], in which operations are each performed three times, is
particularly popular. Employing modular redundancy allows faults to be detected almost
immediately, and delays added to signal paths by the voting logic are small. Diagnostic
resolution is, however, limited by the scale of the functional block being replicated, since a
fault can only be attributed to a particular instance of it. Resolution can be improved by
employing redundancy multiple times on subsections of a block, however area overheads
increase rapidly: TMR requires over 200% extra area compared to a single instance of
the same functional logic. Tradeoffs between resolution and area overhead are therefore
necessary.
Recent works sought to lower the area penalties of modular redundancy through the use
of reduced-precision replicated modules [39] [40] and the application of differing numbers
of them—either none, one (for duplicate-with-compare) or two (for TMR)—depending on
real-time upset rates [41].
Work by de Lima et al. [42] aimed to reduce hardware redundancy by introducing time
redundancy: operations are each carried out twice, serially, by the same logic, with the
inputs for every second computation encoded in order to utilise the hardware differently.
Outputs are buffered, decoded and checked for discrepancies. In the sample application
presented, the technique was reported to consume 2.3% less area than an equivalent TMR
approach while introducing 8% extra latency per replicated computation.
Concurrent error detection techniques generally require the addition of less logic than
redundancy methods. Rather than having operations performed multiple times, error cod-
ing information, e.g. parity, is added to data buses, memories, etc., which can then be
verified by testing hardware to detect errors. Such schemes often suffer from confounding
problems: multiple results may have the same error code values, which can mask faults.
A two-rail scheme for combinational logic was introduced [43] to facilitate error detection.
Boolean functions are split into expressions with no more than four inputs each, with these
then mapped to predefined product and/or sum logic cells, each with normal and comple-
mented outputs, implemented in PLBs. Arbitrarily complex designs can be constructed
29
using such cells, and a two-rail checker cell is provided for testing individual or multiple
logic cells’ outputs: matching logic levels on normal and complementary output pairs in-
dicate faults. Area overheads for this method are high: 76% more area was consumed, on
average, than with direct implementation in PLBs.
Methods normally used in oﬄine testing were employed for online testing in work by
Karri et al. [44], in which TPGs and ORAs are added to functional units to be tested.
Clock cycles known to be otherwise unused by those units are then used to apply test
inputs to them, with their outputs checked for invalid results. Detection latencies of a
few milliseconds were reported, but overheads were high: around 25–35% extra area was
consumed, including 50% more registers.
Levine et al. presented a method for the measurement of timing slack in circuit paths
between registers [4], represented diagrammatically in Figure 2.4. For each path under
monitoring (PUM) required, an additional, ‘shadow’ register (S) is added to its terminus in
parallel with the existing register (P), thereby creating two signal paths (SP1 and SP2).
The shadow register is clocked by a phase-shifted version (S CLK) of the system clock
(M CLK), with the phase swept over time: discrepancies between the registers’ outputs
indicate timing faults. Pass/fail data recorded with varying amounts of phase shift can be
used to infer the path’s maximum frequency and, if it exists, the current margin between
its actual and minimal propagation delays: the timing slack. Both an error counter and
first-fail recorder, which latches once the first timing error has been encountered, are
provided. A maximum error in delay measurement of 1.2% was reported, with a 0.28%
negative speed impact incurred by the addition of the monitoring circuitry. An average
PLB overhead of 2.7% was given.
Current monitoring circuits [45] can be added to designs as suggested by Nicolaidis [46]
in order to detect power usage anomalies indicative of hard faults such as stuck-ats. Usage
of such methods may necessitate periodic slowdown of the application hardware in order
to acquire accurate current readings, however.
2.5 Fault Mitigation
The ability to reconfigure FPGAs at runtime presents opportunities to slow degradation
and/or conceal its effects. The wear-levelling approaches detailed by Stott et al. [5] aim
to improve FPGA reliability by periodically loading new configurations onto the target
device. At design-time, alternative configurations for the application are computed that,
30
Figure 2.4: Timing slack measurement method proposed by Levine et al. [4]. Discrepancies
between the registers’ outputs, the secondary register being clocked by a phase-
shifted version of the primary’s, indicate timing faults.
while exhibiting the same functionality, exercise resources differently such that, when
applied alternately at runtime, they aim to mitigate degradation by minimising electrical
hotspots. Examples of the three classes of wear-levelling presented are shown in Figure 2.5.
Alternative mapping involves the inversion of nets within the application, spare resources
involves the swapping of functionality between used and unused PLBs and alternative
placement involves the shifting of functionality between used PLBs. Reductions in timing
degradation over the course of accelerated-life tests in hardware equivalent to five years of
normal usage of over 20% were reported.
Figure 2.5: Wear-levelling strategies proposed by Stott et al. [5]. Alternative configura-
tions for the application are computed that, while exhibiting the same func-
tionality, exercise resources differently such that, when applied alternately at
runtime, they aim to mitigate degradation by minimising electrical hotspots.
31
2.6 Error Correction
2.6.1 Compile-time Provisioning
It is possible for systems to respond to fault detection and location information with-
out resorting to often expensive, both in terms of time and computational resources, re-
placement and -routing. Schemes that avoid such computation at runtime generally offer
lower fault tolerance than their fully dynamic counterparts, described in Section 2.6.2.
A repair strategy involving the application of precompiled alternative configurations
for particular sections of an FPGA, referred to as tiles, was presented by Lach et al. [6].
Alternatives are intended to be computed such that they utilise different resources within
the tiles in order that, when a faulty resource is identified, an alternative that does not
rely upon its functionality can be selected for that tile. In order to achieve this, at
least one resource within each tile must be reserved as spare and configurations that
each avoid the use of at least one of the resources within it computed. The principle
operation is exemplified in Figure 2.6: four alternative configurations, each exhibiting
identical functionality and featuring the same and consistently placed inputs (A to D)
and output (Y), are shown, with each of the alternatives designed such that a different
resource within the tile is left unused. Multiple failures within a single tile cannot be
tolerated without yet more spare resource reservation and the design-time computation
and runtime storage of more configurations.
A fault-tolerant technique based upon cluster shifting in ‘chains’ has also been pre-
sented [7]. Here, functionality in faulty logic is shifted to neighbouring fault-free clusters,
some of which are reserved at design-time, and the strategy for reserved interconnect en-
sures that rerouting is not required and extra delay to data paths is not added following
reconfiguration. An example of this is shown in Figure 2.7: if the cells containing functions
A and E are found to be faulty, all cell functionality is shifted right by one place, with
routing reconfigured to suit the shift, in order to prevent the use of the non-operational
resources. Once the shift has taken place, the rightmost cells—originally reserved as
spares—are then occupied by functions D and H.
An FPGA architecture drawing inspiration from biology, reported to exhibit self-repair
and -healing properties, has been described [47], with analysis of its reliability presented in
later work [48]. Array elements are analogous to biological cells, each storing the functional
configuration for the entire device, and perform certain binary functions dependent upon
32
Figure 2.6: Visualisation of the repair scheme proposed by Lach et al. [6], in which four
alternative configurations, each exhibiting identical functionality and featuring
the same and consistently placed inputs (A to D) and output (Y), are shown,
with each of the alternatives designed such that a different resource within the
tile is left unused.
their position. It was proposed that fault combating should be attempted via shifting
of functionality between cells, with circuitry added to allow those found to be faulty to
be bypassed: a minimal level of reconfiguration would therefore be required to overcome
faults. Significant area overhead is introduced by the requirement to store a functional
description for the entire array within each of its elements, however.
2.6.2 Runtime Provisioning
Repair schemes that require placement and/or routing to be performed at runtime, while
dependent upon the availability of often significantly powerful reconfiguration controllers,
offer the most potential for repair since they are amongst the least restricted in terms of
33
Figure 2.7: Visualisation of the repair scheme proposed by Hanchek et al. [7]. Functionality
in faulty logic is shifted to neighbouring fault-free clusters, some of which are
reserved at design-time, and the strategy for reserved interconnect ensures that
rerouting is not required and extra delay to data paths is not added following
reconfiguration.
resource utilisation.
Computationally expensive and time-consuming chip-wide placement and routing can
be avoided in cases where faults are contained, and repair can be successfully completed,
within a single logic cluster [49]. Although the application of such schemes does not have
drastic effects upon system timing, many faults cannot be tolerated and they are reliant
upon spare resources being available within each cluster for use during repair.
The concept of pebble-shifting [50] was introduced as a means to bypass faulty clusters
while minimising timing degradation by shifting functionality between them, consuming
fault-free spares where necessary, while attempting to keep interconnect lengths between
clusters as short as possible. Average timing degradation of 0.13% after such shifting had
taken place was reported [51].
34
The roving STAR testing architecture has also been used as the basis for repair [3]. A
combination of in-cluster reconfiguration and pebble shifting, with preference of applica-
tion in that order, is used, and the proposed fault-tolerant techniques also allow for, in
some cases, the use of partially defective resources. A worst-case figure for the proportion
of time the system clock must be frozen—halting the application—in order to test for,
diagnose and correct repairable faults of 6.25% was given.
Faults in PLBs can also be dealt with at the cluster level [8]: functionality in faulty
clusters is moved to fault-free resources, while interconnect faults are corrected via rip-up
and rerouting. A representation of such a repair is presented in Figure 2.8: if the leftmost
cluster is found to be faulty, functionality is moved in order to restore correct operation
while keeping the wire lengths between clusters as short as possible. This is achieved,
in this case, by shifting each of the three functions one place to the right. Configuration
memory readback is exploited to avoid having to store a netlist for the application, however
significant computational resources are required for the compilation of new configurations.
Figure 2.8: Visualisation of the repair scheme proposed by Emmert et al. [8]. Functionality
in faulty clusters is moved to fault-free resources, while interconnect faults are
corrected via rip-up and rerouting.
An evolutionary approach to repair has also been taken [9]: here, pairs of alternative
configurations of the same functional module, taken from a group of competing candidate
configurations, are tested against each other. When particular instances of a module
35
are found to produce incorrect output, they are randomly mutated and readded to the
group for analysis. Figure 2.9 represents the proposed layout: the positions of the pair
of active configurations are shown, along with the surrounding control logic. Note that
each competing configuration contains circuitry—a discrepancy checker—to compare its
output to its neighbour’s; this is done such that errors within the testing circuitry itself
can be detected and potentially repaired. Although such systems may be able to ‘discover’
novel repair strategies, their random nature makes them entirely indeterministic: there
is no guarantee of finding suitable repairs within particular times nor, indeed, that such
repairs even exist.
Figure 2.9: Evolutionary repair scheme proposed by DeMara et al. [9]. When particular
instances of a module are found to produce incorrect output, they are randomly
mutated and readded to a group of competing candidate configurations for
analysis through being tested against each other.
Fault-tolerant architectures based around microprocessor cores residing on FPGAs have
been analysed [52] [53]. In work by Girau et al. [52], an array of 156 simple, identical
cores was proposed, with each core able to detect faults within itself and its neighbours;
36
functionality within faulty cores is simply moved to fault-free spares. A software framework
allowing faulty hardware peripherals to be replaced with soft-core equivalents has also been
presented [53].
2.7 Algorithm-based Fault Tolerance
By applying fault-tolerant techniques at a level above transistors, gates or small circuits—
at the algorithmic layer—it is possible to produce robust designs capable of detecting the
presence of faults during their normal operation with low impacts upon both area and
performance. A wide range of linear algebra operators, including matrix operations and
Fourier transformations, can be protected with ABFT techniques [54] [55]. ABFT was re-
cently applied to matrix multiplication in FPGAs [56] with promising results: the authors
reported a 99% decrease in design vulnerability at the expense of 25% area overhead.
While ABFT has traditionally been used to protect fixed-point operations, the methods
are compatible with floating-point arithmetic as well. Of relevance to Chapter 6 are the
errors, in this case introduced by floating-point operations, which necessitate error bound-
ing to distinguish them from those caused by other mechanisms [57]. Recent work [58]
sought to lower the required bounds in a graphics processing unit (GPU)-accelerated
floating-point benchmark by analysing input data prior to each computation.
2.7.1 Matrix Encoding & Decoding
Anym×n data matrix, D, can be augmented with distance (d) additional rows of column-
wise checksums to produce an (m+ d)× n column checksum-encoded matrix, Dc. This is
achieved by performing Dc = GcD, where generator matrix Gc is constructed as shown
in Equation 2.1. I is the identity matrix.
Gc =


Im×m
20 20 · · · 20
20 21 · · · 2m−1
...
...
. . .
...
20 2d−1 · · · 2(d−1)(m−1)


(2.1)
The final d rows of Gc are linearly independent [59]; thus, they represent a distance-
37
(d+ 1) code and are consequently capable of facilitating the detection of at least d errors
per column. An example of Gc’s application, in which m = n = d = 2, is given in
Equation 2.2.
Dc = GcD =


1 0
0 1
1 1
1 2



1 2
3 4

 =


1 2
3 4
4 6
7 10


(2.2)
Note that D itself is a sub-matrix within Dc, occupying the uppermost m× n elements.
The added column-wise checksums are shown in red.
Row- rather than column-wise checksums can be added to a data matrix by performing
the complementary operation, Dr = DGr, where generator matrix Gr is as shown in
Equation 2.3.
Gr =


In×n
20 20 · · · 20
20 21 · · · 2d−1
...
...
. . .
...
20 2n−1 · · · 2(n−1)(d−1)


(2.3)
Gr’s application to data matrix D will lead to the addition of d columns of row-wise
checksums, producing an m × (n + d) row checksum-encoded matrix Dr. The linear
independence property of the final d rows of Gc also applies to the final d columns of
Gr. Equation 2.4 contains an example of Gr’s application in which m = n = d = 2.
Dr = DGr =

1 2
3 4



1 0 1 1
0 1 1 2

 =

1 2 3 5
3 4 7 11

 (2.4)
Row-wise checksums are shown in green. In parallel to the application of Gc, note that
D itself occupies the leftmost m× n elements of Dr.
An (m + d) × (n + d) full checksum-encoded matrix Df can be formed by performing
column and row checksum generation simultaneously: Df = GcDGr. For the same data
matrix D used in Equations 2.2 and 2.4 with d = 2, the corresponding Df is as shown in
Equation 2.5.
38
Df = GcDGr =


1 0
0 1
1 1
1 2



1 2
3 4



1 0 1 1
0 1 1 2

 =


1 2 3 5
3 4 7 11
4 6 10 16
7 10 17 27


(2.5)
Note that the elements of D appear as before, and that column- and row-wise checksums
present in Df are identical to those obtained in Equations 2.2 and 2.4, respectively. The
additional elements, shown in yellow, are both column- and row-wise checksums since they
can be formed from elements in either dimension.
Decoding a checksum-encoded matrix of any type is a trivial process, requiring simply
stripping off the final d rows (for a column checksum-encoded matrix), columns (for row)
or both (for full) to leave the data sub-matrix only.
Following storage, transmission or computation that preserves the form of checksum-
encoded matrices, integrity can be verified by comparing the checksum elements within a
checksum-encoded matrix to those produced from the data elements. Non-zero differences
are indicative of the presence, locations and magnitudes of errors within checksum-encoded
matrices. Column and row checksum-encoded matrices, respectively, can be multiplied by
verification matrices V c and V r, shown in Equations 2.6 and 2.7, to produce discrepancy
matrices ∆c and ∆r for this purpose.
V c =


20 20 · · · 20
−Id×d
20 21 · · · 2m−1
...
...
. . .
...
20 2d−1 · · · 2(d−1)(m−1)


(2.6)
V r =


20 20 · · · 20
20 21 · · · 2d−1
...
...
. . .
...
20 2n−1 · · · 2(n−1)(d−1)
−Id×d


(2.7)
39
∆c is produced by performing ∆c = V cDc, yielding a d × n discrepancy matrix, while
performing ∆r = DrV r produces an m×d discrepancy matrix ∆r. Equations 2.8 and 2.9
respectively show the verification process for the checksum-encoded matrices Ac and Ar
obtained in Equations 2.2 and 2.4.
∆c = V cAc =

1 1 −1 0
1 2 0 −1




1 2
3 4
4 6
7 10


=

0 0
0 0

 (2.8)
∆r = ArV r =

1 2 3 5
3 4 7 11




1 1
1 2
−1 0
0 −1


=

0 0
0 0

 (2.9)
Verification of a full checksum-encoded matrix is done by considering column- and row-
wise checksums independently, producing two discrepancy matrices: column-wise ∆f, c
and row-wise ∆f, r. For the full checksum-encoded matrix obtained in Equation 2.5, ∆f, c
and ∆f, r are calculated as shown in Equations 2.10 and 2.11, respectively.
∆f, c = V cDf =

1 1 −1 0
1 2 0 −1




1 2 3 5
3 4 7 11
4 6 10 16
7 10 17 27


=

0 0 0 0
0 0 0 0

 (2.10)
∆f, r = DfV r =


1 2 3 5
3 4 7 11
4 6 10 16
7 10 17 27




1 1
1 2
−1 0
0 −1


=


0 0
0 0
0 0
0 0


(2.11)
40
2.7.2 Application to Arithmetic Operations
Matrix-matrix Multiplication
Assuming dimensional compatibility, the multiplication of a column checksum-encoded
matrix Ac by a row checksum-encoded matrix Br will yield a full checksum-encoded
matrix Cf. As an example, consider the matrix-matrix multiplication in Equation 2.12.


1 2
3 4
4 6
7 10



5 6 11 17
7 8 15 23

 =


19 22 41 63
43 50 93 143
62 72 134 206
105 122 227 349


(2.12)
Matrix Addition
The addition of two encoded matrices of identical type will produce a result of the same
form. For example, consider the matrix addition shown in Equation 2.13.


1 2 3 5
3 4 7 11
4 6 10 16
7 10 17 27


+


5 6 11 17
7 8 15 23
12 14 26 40
19 22 41 63


=


6 8 14 22
10 12 22 34
16 20 36 56
26 32 58 90


(2.13)
Matrix-vector Multiplication
The multiplication of a column checksum-encoded matrix Ac by a column vector b will
produce a checksum-encoded column vector cc. Note that the multiplication of a row
vector by a row checksum-encoded matrix will produce a checksum-encoded row vector.
For example, consider the matrix-vector multiplication shown in Equation 2.14.


1 2
3 4
4 6
7 10



5
6

 =


17
39
56
95


(2.14)
41
Matrix-scalar Multiplication
The scalar multiplication of any type of encoded matrix will yield a result of identical
form. Consider Equation 2.15 as an example.
5


1 2 3 5
3 4 7 11
4 6 10 16
7 10 17 27


=


5 10 15 25
15 20 35 55
20 30 50 80
35 50 85 135


(2.15)
LU Decomposition
Any matrix A decomposable into lower- and upper-triangular (LU) matrices L and U
can be decomposed into checksum-encoded matrices Lc (with column-wise checksums)
and U r (row-wise) with identical information content if A is first transformed into its full
checksum-encoded equivalent Af. As an example, consider the LU decomposition shown
in Equation 2.16.


4 5 9 14
8 28 36 64
12 33 45 78
20 61 81 142


=


1 0
2 3
3 3
5 6



4 5 9 14
0 6 6 12

 (2.16)
Transposition
Checksums of all types are preserved through the transposition of a matrix. A full
checksum-encoded matrix will remain as such, while a row or column checksum-encoded
matrix will become the opposite type. For example, consider the transposition shown in
Equation 2.17.


1 2 3 5
3 4 7 11
4 6 10 16
7 10 17 27


T
=


1 3 4 7
2 4 6 10
3 7 10 17
5 11 16 27


(2.17)
42
Linear Filtering
The result shown in Section 2.7.2 is of particular significance since any one-dimensional
linear filter—finite impulse response (FIR), infinite impulse response (IIR), discrete Fourier
transform (DFT) (including fast Fourier transform (FFT)), etc.—can be represented in
state-space form as a matrix-vector multiplication [54].
2.7.3 Result Classification
Once operations have been performed upon checksum-encoded matrices, the positions of
incorrectly computed elements allow the classification of each result into one of a number
of categories. Consider the outcomes for operations performed that result in the generation
of distance-2 full checksum-encoded 2× 2 matrices shown in Equation 2.18.


✓ ✓ ✓
✓ ✓ ✓
✓ ✓ ✓

 ,


✓ ✓ ✓
✓ ✓ ✓
✗ ✓ ✓

 ,


✗ ✓ ✓
✓ ✓ ✓
✓ ✓ ✓

 ,


✗ ✓ ✗
✓ ✓ ✓
✗ ✓ ✗

 (2.18)
Assuming that one or more faults occurred along the datapath that generated each result,
they respectively represent examples of masked, false positive, false negative and detected
faults. Note that the same classification holds for all other checksumming types and
distances discussed in Section 2.7.1.
Masked faults are those that have no observable external effect. The addition of two
even integers with an adder whose least-significant output bit is stuck at zero represents
a trivial example for such a result. False positives are those that affect only checksum
elements, leaving information unaffected. Faults that manifest within circuitry used solely
for checksumming, not within the associated dathapath, are responsible for their occur-
rence. While undesirable since they will force unnecessary corrective action, such results
are safe since their information elements remain valid. False negatives, on the other hand,
are unsafe since they represent results in which incorrectly computed information elements
are undetectable due to checksums being valid. Faults occurring along datapath logic are
always responsible for false negative result occurrence. In the case of a single distance-2
checksum, a single information element of value one higher than expected, along with
equal checksum discrepancies, represents a simple numerical example of a false negative
result. Faults are considered to be successfully detected when at least one information
43
element, and at least one checksum, are invalid.
2.8 Conclusion
This chapter presented an overview of the current state of the art in relation to fault toler-
ance, with discussion focussing upon techniques applicable to, and developed for, FPGAs.
Distinctions between various types of faults and the errors they cause was presented first,
with discussion of degradation following. Methods for fault detection—both oﬄine and
online—were presented, with fault mitigation and correction—methods both involving and
avoiding re-placement and -routing at runtime—discussed afterwards. The final section
focussed upon ABFT; an online fault tolerance technique tailored to particular algorithms.
The mathematical prerequesites for ABFT were given, followed by examples of its applica-
tion to various linear algebraic operators. Finally, a method of classifying results obtained
from ABFT-protected operations was presented.
As a result of this background research, ABFT was identified as a promising area for
further exploration, with the high-level aim of keeping overheads—in terms of area, per-
formance and power—low while maintaining sensitivity, and the ability to react, to faults
high. To validate this decision, Table 2.1 places ABFT within a side-by-side comparison
of competing families of fault detection strategies, comparing their respective abilities to
detect certain types and proportions of potential faults. ABFT can be seen to compare
favourably to its competitors by striking a balance between these factors that the others
cannot: mixing high fault coverage with low detection latency while keeping application
overheads—both in terms of area and latency—relatively low at the cost of being tailored
to a limited number of specific algorithms.
44
Method
Fault type(s) Fault Detection Application overheads Other
targetted coverage latency Area Latency limitations
BIST Permanent
Moderate–
High None None
Requires
high downtime
Roving
Permanent High High
Low–
Low
scans moderate
Modular
Both High Low High Low
redundancy
Re-execution Transient
Low–
Moderate Moderate High
moderate
Concurrency Both Low Low Low Low
Datapath-
unsuitable
Cycle-
Permanent
Moderate– Low– Moderate–
Low
stealing high moderate high
ABFT Both High
Low– Low–
Low
Algorithm-
moderate moderate specific
Table 2.1: Comparison of fault detection methods, showing ABFT to compare favourably
to its competitors by striking a balance between factors that the others cannot:
mixing high fault coverage with low detection latency while keeping application
overheads—both in terms of area and latency—relatively low at the cost of
being tailored to a limited number of specific algorithms.
45
3 Algorithm-tailored Low-overhead
Online Error Detection
3.1 Introduction
In order to assess the impacts of applying algorithm-based fault tolerance (ABFT) to
a benchmark accelerator, an application circuit to which ABFT could be applied was
required. Matrix multiplication was chosen as the case study for the work described
in this thesis since the operator is used in many hardware-accelerated applications and
because the adaptation of its operation for ABFT is straightforward. This chapter details
the design, implementation and evaluation of an ABFT-protected matrix multiplication
accelerator running in hardware on an field-programmable gate array (FPGA).
The findings herein include that, for the largest-implemented accelerator tested, datap-
ath fault detectability in excess of 99% was achieved in return for area and performance
overheads of 7.87% and 45.5%, respectively, demonstrating that the achievement of high
fault observability does not necessitate the incursion of, in particular, huge area overheads.
Note that the focus of this chapter is intended to be upon design and implementation.
With regards to observability testing, in paticular, analysis is relatively brief and focusses
upon permanent faults only; a far more detailed analysis of this operator’s robustness,
including against transient faults, is presented in Chapter 5.
3.1.1 Contributions
The original contributions of the work presented in this chapter are:
• The implementation of a complete hardware-software platform for the verification
of ABFT protection of a benchmark matrix multiplication accelerator.
• A quantitative analysis of the overheads—in terms of area and performance—
incurred through the incorporation of ABFT protection within that benchmark
46
circuit.
• Insight into the hardware fault tolerance of ABFT upon that benchmark.
3.1.2 Publications
The work presented in this chapter has been peer-reviewed and appeared in the 2013 pro-
ceedings of the International Conference on Field-programmable Technology (FPT) [60].
3.1.3 Outline
The remainder of this chapter is organised as follows. Section 3.2 details the development
of, primarily, hardware needed to evaluate the fault-hardening of an application circuit.
Section 3.2.1 describes the platform, while Section 3.2.2 focusses upon the design and
operation of the chosen matrix multiplication benchmark. Section 3.2.3 explains the steps
taken to modify the benchmark to achieve fault tolerance, with Sections 3.2.4 and 3.2.5
giving implementational details concerning fault location inference and fault injection,
respectively. Section 3.3 deals with the encountered overheads for the fault-tolerant design
over its unprotected equivalent, with Section 3.3.1 focussing upon area and Section 3.3.2
upon performance. An analysis of the hardened circuit’s fault observability is presented
in Section 3.4, and concluding remarks are given in Section 3.5.
3.2 Implementation
A platform was required upon which custom hardware could be implemented and tested
quickly. Particularly for the testability requirement, as well as for the potential of realising
hybrid hardware-software solutions to fault tolerance in the future due to the availability
of hard central processing unit (CPU) cores, a relatively new system-on-chip (SoC) was
used: a member of the Xilinx Zynq [61] family. The baseline or reference accelerator was
first to be designed, with hardware operation confirmed using one of the available CPUs
for oversight. Following this, ABFT logic was added to the baseline design, and the CPU
was further used for targetted error injection to exercise the error detection circuitry and
ensure accurate fault location.
47
3.2.1 Hardware-software Platform
Figure 3.1 gives a high-level overview of the developed platform. All boxed components
shown are contained within a Xilinx Zynq-7000 XC7Z020 SoC [62].
DRAM
ARM core
DRAM
controller
PS
PL
Interrupt
controller
AXI4-Lite
interface
AXI4
interface
DMA
controller
Memory
controller
Accelerator
Memory
controller
b
b
b
b
Figure 3.1: Top-level system block diagram. All boxed components shown are contained
within a Xilinx Zynq-7000 XC7Z020 SoC [62], with a custom-designed matrix
multiplication accelerator wrapped by several Xilinx IP blocks on the PL side
of the device.
The Zynq SoC is split into two distinct halves: the processor subsystem (PS), housing
a pair of hard Acorn reduced instruction set computer machine (ARM) Cortex-A9 CPU
cores, and programmable logic (PL): a modestly sized (53,200-look-up tables (LUTs))
FPGA. Many available hard peripherals on the device, such as memory and high-speed
input/output (I/O) controllers, are multiplexed so as to be accessible to either the ARM
cores or custom logic, while others are tied directly to the PS or PL. High-speed config-
urable interconnect is provided for facilitating PS-PL communication.
Throughout the hardware development and testing described in this thesis, a single CPU
core was used in ‘bare-metal’ (without operating system) fashion, primarily as a controller
for the FPGA-implemented logic but also for test vector generation, result verification and
latency measurement. The hard dynamic random-access memory (DRAM) controller on
the PS side was configured in order to allow fast shared memory access by both the ARM
core and soft logic.
Several Xilinx IP blocks—a direct memory access (DMA) controller [63], two mem-
ory controllers [64] and interfacing logic to service data transfers [65] and interrupts [66]
across the PS-PL boundary—sit to control and feed data into and out of a custom-designed
48
matrix multiplication accelerator, described in Section 3.2.2. Software-accessible control
registers within the accelerator are connected to the CPU with a low-bandwidth Ad-
vanced eXtensible Interface (AXI)-Lite bus, while data transfers are achieved via a high-
bandwidth AXI bus operated in burst mode. Interrupts are triggered by status register
changes within the accelerator.
3.2.2 Baseline Architecture
The architecture developed for accelerating matrix multiplication is shown in Figure 3.2.
It, along with all other hardware developed as part of the work presented in this thesis, was
written in platform-independent very high-speed integrated circuit hardware description
language (VHDL). The following parameters are customisable:
• Square matrix size (s).
• Data width (n) (bits).
• Data memory resource type.
• Multiplier resource type.
• Multiplier latency (m) (cycles).
• Accumulator latency (a) (cycles).
Throughout the hardware development conducted as described in this thesis, matrices are
always square with dimensions s × s, however this is not a requirement imposed by the
fault tolerance methods presented.
In
p
u
t
R
A
M
2s× ns
n
s
b
n
s
B
ns s : 1
A
n
b
× + b
b
× + b
.
.
.
.
.
.
.
.
.
× + b
C
n
s
O
u
tp
u
t
R
A
M
s× ns
Figure 3.2: Baseline datapath for matrix multiplication. A full matrix row’s contingent of
MACs are used to create an efficient inner loop-unrolled architecture.
49
Multiplier pipelining is achieved automatically via retiming of register chains instan-
tiated on the multipliers’ outputs by the computer-aided design (CAD) tools used for
compilation; registers are ‘pushed backwards’ through each multiplier to balance combina-
torial logic latency between stages, increasing timing model-inferred maximum operating
frequency (fmax), without affecting cycle latency. Accumulator pipelining must be imple-
mented more explicitly since feedback is required. To allow accumulator pipelining, the
architecture shown in Figure 3.3 was developed, allowing latency to be increased from 1
to a and maximum adder widths to be reduced from n to ⌈n/a⌉. Adder widths are chosen
to be as close to optimal (i.e. n/a) as possible, with wider adders used first, if necessary.
From all but the final stage, an overflow (bit ⌈n/a⌉) signal is tapped off the adder to feed
into the subsequent stage. A shift register—not shown in Figure 3.3—bubbles reset signals
from input to output, allowing consecutive accumulations to occur with only a single cycle
used for reset, regardless of the value of a. The finite state machine (FSM) controlling the
datapath—not shown in Figure 3.2—adjusts itself to accommodate for changes in m and
a.
Before the accelerator runs, input data is transferred, triggered by a command from the
CPU, from DRAM to FPGA fabric random-access memory (RAM). FPGA RAM words
represent full matrix rows (ns bits each) to increase data parallelism: for the largest-
implemented matrix size (s = 32), with n = 32 as mentioned previously, 1024-bit words
were used. A single RAMwas used on the input side rather than two (one per input matrix)
since the increased memory transfer times were found to outweigh potential speedup. The
input RAM therefore consists of 2s ns-bit words, while the output RAM has s ns-bit words.
To allow connection to the memory controllers for PS access, the RAMs are asymmetric:
regardless of n or s, memory access ports presented to the external memory controllers
are always 64-bit to match the maximum width of the high-performance AXI interconnect
available on the target device [67]. To achieve this, address decoding drives either byte
write enables (for the write port of the dual-ported input RAM) or read data multiplexing
(for the read port of the dual-ported output RAM).
With reference to Figure 3.2 and the accompanying pseudocode, Algorithm 1, com-
putation proceeds as follows following an input data transfer. Once the first row of
matrix A is fetched (Line 2) and buffered into a register (Line 3) and the multiply-
accumulators (MACs) are reset (Line 4), each row of matrix B is fetched from RAM in
turn (Line 7) and presented to a full row’s contingent of MACs (Lines 8 to 11), along with
50
Input
n
+ b b . . .
[
⌈n/a⌉ − 1 : 0
]
[
⌈n/a⌉ − 1 : 0
]
+ b b . . .
[
⌈n/a⌉ − 1 : 0
]
[
⌈n/a⌉ − 1 : 0
]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
+ b. . .
[
⌈n/a⌉ − 1 : 0
]
Output
n
[
⌈n/a⌉ − 1 : 0
]
[
2⌈n/a⌉ − 1 : ⌈n/a⌉
]
[
n− 1 : (a− 1)⌈n/a⌉
]
[
⌈n/a⌉
]
[
⌈n/a⌉
]
[
n− (a− 1)⌈n/a⌉ : 0
]
Figure 3.3: Pipelined accumulator, in which a adders each up to ⌈n/a⌉ bits wide are used to achieve pipelining at the expense of increased
latency.
51
the corresponding element of A (Line 6), for computation. Once the final row of B has
been consumed, the computation of C’s first row is complete: it is stored into the output
RAM (Line 13) and the process is repeated for the remaining rows of A and C. Once a
multiplication has been completed in its entirety, an interrupt occurs, triggering a second
DMA transfer to copy data back from the FPGA’s RAM to DRAM.
Algorithm 1 Baseline matrix multiplication
1: for i = 0 to s− 1 do
2: fetch A[i]
3: buffer A[i]
4: C[i]←
(
0 0 · · · 0
)
5: for j = 0 to s− 1 do
6: select A[i][j]
7: fetch B[j]
8: C[i][0]← (C[i][0] +A[i][j]×B[j][0]) mod 2n
9: C[i][1]← (C[i][1] +A[i][j]×B[j][1]) mod 2n
10: · · ·
11: C[i][s− 1]← (C[i][s− 1] +A[i][j]×B[j][s− 1]) mod 2n
12: end for
13: store C[i]
14: end for
3.2.3 Checksum Generation & Verification
To support error detection, several additions and modifications were made to the acceler-
ator presented in Section 3.2.2; these are shown in Figures 3.4, 3.5 and 3.6 and described
algorithmically in Algorithms 2, 3 and 4. Aside from lengthening accelerator latency, the
changes made have no effect upon normal operation: data transferred between DRAM
and FPGA RAM, and vice-versa, is identical in both quantity and form to that moved
previously.
The checksum generation logic shown sandwiched between the input RAM and MACs
in Figure 3.4 is detailed in Figure 3.5. The input buffer and multiplexer (MUX) serve
the same purpose as those shown in Figure 3.2 and described in Section 3.2.2, while
the adder, register, RAMs and two additional MUXes form the logic responsible for the
transformation of A and B into checksum-encoded matrices Ac and Br. Within the
datapath itself, an extra MAC is added to mirror the expansion of the matrices being
multiplied: what was an s × s multiplication is now effectively (s + 1) × (s + 1). On the
output side, a buffering register is added to hold rows of Cf as they are computed; this
is shown in Figure 3.4. The checksum verification logic shown in Figure 3.6, consisting
52
In
p
u
t
R
A
M
2s× ns
n
s
C
h
ec
k
su
m
g
en
er
a
ti
o
n
lo
g
ic
Ac
n
n
(s
+
1
)
Br
b
× + b
b
× + b
.
..
.
..
.
..
× + b
Cf
n
(s
+
1
)
n(s+ 1)
b
n
s
O
u
tp
u
t
R
A
M
s× ns
n
(s
+
1
)
C
h
ec
k
su
m
v
er
ifi
ca
ti
o
n
lo
g
ic
Figure 3.4: ABFT-protected matrix multiplication datapath. Aside from lengthening ac-
celerator latency, the changes made have no effect upon normal operation:
data transferred between DRAM and FPGA RAM, and vice-versa, is identical
in both quantity and form to that moved previously. An extra MAC is added
to mirror the expansion of the matrices being multiplied: what was an s × s
multiplication is now effectively (s+ 1)× (s+ 1).
of a MUX, two adders, a register, RAM and two comparators, are added following the
buffer. The results generated by this logic are used to inform the accelerator’s controller
of any checksum discrepancies that occur during computation. The resource used for the
implementation of small RAMs needed for checksum generation and verification is also
parameterisable.
I
n
p
u
t
R
A
M
n
s
b
ns s : 1
n
b
+ b
b
csc RAM
s× n
b
csr RAM
s× n
n
Ac
n
(s
+
1
)
Br
Figure 3.5: Checksum generation logic for matrix multiplication accelerator, responsible
for the transformation of A and B into checksum-encoded matrices Ac and
Br.
The operations performed during each multiplication are largely unchanged following
the addition of this ABFT logic: results are computed row-by-row, as before, with input
checksums generated as input data is accessed and output checksums verified as output
data is computed. The exception to this is that, due to the need to access complete rows
of Br on a one-per-cycle basis, its checksums are computed before multiplication begins.
53
Cf
n
(s
+
1
)
(s+ 1) : 1
n
b + b =
· · ·
· · ·
s
+
1
c
s
r
s
O
K
b b
+
c
s
c
R
A
M
(s+ 1)× n
b =
· · ·
· · ·
s
+
1
c
s
c
s
O
K
Figure 3.6: Checksum verification logic for matrix multiplication accelerator, responsible
for checking checksums contained within Cf. The results generated by this
logic are used to inform the accelerator’s controller of any checksum discrep-
ancies that occur during computation.
Algorithm 2 details the operation of Figure 3.5’s logic during this precomputation. Rows
of B are fetched from RAM (Line 2) and buffered in turn, with their elements selected
sequentially by a MUX (Line 6) in order to be accumulated (Line 7). As each row’s check-
sum is computed, it is stored (Line 9) in the row-wise checksum vector (csr) RAM. While
some performance penalty results from the decision to have row checksums precomputed,
the area overhead saving is significant since an s-input adder would otherwise be required
to complete those computations at speed.
Algorithm 2 ABFT-protected matrix multiplication Br[0 · · s− 1][s] precomputation
1: for i = 0 to s− 1 do
2: fetch B[i] as Br[i][0 · · s− 1]
3: buffer Br[i][0 · · s− 1]
4: Br[i][s]← 0
5: for j = 0 to s− 1 do
6: select Br[i][j]
7: Br[i][s]← (Br[i][s] +Br[i][j]) mod 2
n
8: end for
9: store Br[i][s]
10: end for
Only one adder is needed in the checksum generation logic since its function switches
to Ac column checksum generation once Br’s row checksums have been generated. Algo-
rithm 3 details the operation of both the checksum generation logic and main datapath,
shown in Figure 3.4, following Br precomputation. Here, the MAC component of the
algorithm (Lines 15 to 18) covers the ‘additional’ Br and Cf column, s, not present in
either B nor C. As rows of A[i] are fetched from RAM (Line 3) then buffered and their
54
elements selected (Line 9), checksums are computed using Figure 3.5’s adder and column-
wise checksum vector (csc) RAM as an accumulator (Lines 20 to 26). A RAM is required
here rather than a register since the particular column’s checksum being calculated changes
every cycle. Rows of Br must be sourced from both the input RAM (Line 13) and csr
RAM (Line 14) simultaneously. The final row of Ac is treated differently (Line 11) since
it originates from the csc RAM rather than the input RAM. Sections of Cf that form
part of C, i.e. all but the final row and column, are stored in the output RAM (Line 31)
after buffering.
Algorithm 3 ABFT-protected matrix multiplication main computation
1: for i = 0 to s do
2: if i < s then
3: fetch A[i] as Ac[i]
4: buffer Ac[i]
5: end if
6: Cf[i]←
(
0 0 · · · 0
)
7: for j = 0 to s− 1 do
8: if i < s then
9: select Ac[i][j]
10: else
11: fetch Ac[s][j]
12: end if
13: fetch B[j] as Br[j][0 · · s− 1]
14: fetch Br[j][s]
15: Cf[i][0]← (C f[i][0] +Ac[i][j]×Br[j][0]) mod 2
n
16: Cf[i][1]← (C f[i][1] +Ac[i][j]×Br[j][1]) mod 2
n
17: · · ·
18: Cf[i][s]← (C f[i][s] +Ac[i][j]×Br[j][s]) mod 2
n
19: if i < s then
20: if i = 0 then
21: Ac[s][j]← 0
22: else
23: fetch Ac[s][j]
24: end if
25: Ac[s][j]← (Ac[s][j] +Ac[i][j]) mod 2
n
26: store Ac[s][j]
27: end if
28: end for
29: buffer Cf[i]
30: if i < s then
31: store Cf[i][0 · · s− 1] as C[i]
32: end if
33: end for
Checksum verification occurs in parallel with Ac checksum generation and normal com-
55
putation. With reference to Figure 3.6 and Algorithm 4, verification is completed as
follows. As rows of Cf become available (Line 2), their elements are selected in turn
(Line 5) and accumulated in both row- (Line 7) and column-wise (Line 17) fashions simul-
taneously. Once the final column (for row-wise checksums, Line 9) or row (for column-wise
checksums, Line 21) of Cf is reached, the appropriate checksum is compared with its corre-
sponding element. These results are stored in software-accessible registers for analysis by
the accelerator’s driver. Note that the verification of Cf’s column-wise checksums requires
an accumulator constructed using a RAM, rather than a register, since the column index
changes once per cycle.
Algorithm 4 ABFT-protected matrix multiplication Cf verification
1: for i = 0 to s do
2: wait until Cf[i] available
3: csr[i]← 0
4: for j = 0 to s do
5: select Cf[i][j]
6: if j < s then
7: csr[i]← (csr[i] +Cf[i][j]) mod 2
n
8: else
9: csrs OK[i]← csr[i] = Cf[i][s]
10: end if
11: if i < s then
12: if i = 0 then
13: csc[j]← 0
14: else
15: fetch csc[j]
16: end if
17: csc[j]← (csc[j] +Cf[i][j]) mod 2
n
18: store csc[j]
19: else
20: fetch csc[j]
21: cscs OK[j]← csc[j] = Cf[s][j]
22: end if
23: end for
24: end for
3.2.4 Fault Location
Consider the simple ABFT-protected matrix multiplication shown in Equation 3.1. Had
the multiplication resulted in, for example, one of the alternative outputs shown in Equa-
tion 3.2 instead, the positioning of incorrect checksum values would have revealed location
information regarding the MACs that caused the errors. Each of these three cases is syn-
56
onymous with a single MAC register’s least-significant bit (LSB) experiencing a stuck-at-
one (SA1) fault. Elements that have been calculated incorrectly are shown in bold, while
italics mark error-indicating checksum values. Note that column checksum mismatches
relate one-to-one with faulty MACs, since each MAC is responsible for computing exactly
one output column’s elements. Simultaneous faults occurring both within an individual
MAC and across multiple MACs would yield equally informative results: a single column
checksum mismatch in the former case and multiple in the latter.


1 2
3 4
4 6



5 6 11
7 8 15

 =


19 22 41
43 50 93
62 72 134

 (3.1)


21 22 41
45 50 93
63 72 134

 ,


19 23 41
43 51 93
62 73 134

 ,


19 22 43
43 50 95
62 72 135

 (3.2)
3.2.5 Error Injection
Rather than opting to directly cause faults, whether by purposefully upsetting configura-
tion bitstreams or otherwise, it was instead chosen to emulate datapath faults by causing
data errors at MAC outputs. This is accomplished by issuing error injection instructions
on the controlling CPU, which cause one or more specified bits of one or more particular
MAC outputs to be flipped using an array of exclusive-OR (XOR) gates in hardware.
While not particularly representative of any real-world fault type, this simple scheme was
chosen for hardware testing since it allows different errors to be injected tens of thousands
of times per second without causing confounding, or masking, issues to occur which would
skew performance results by allowing errors to go undetected.
It should be emphasised that, as is also the case for the remaining hardware and software
(fault observability) testing described later in this thesis, errors emulated at MAC outputs
are not indicative of faults occurring merely at those locations. Occurrences of faults at
any point along the datapath, whether in logic or routing, are liable to produce errors
at MAC outputs since those are the points through which all data must travel prior to
storage.
57
3.3 Overheads
Experiments were performed to assess the impacts of adding ABFT protection to the
baseline accelerator in terms of area and performance. All designs were compiled using
version 14.7 of Xilinx’s Integrated Synthesis Environment (ISE) toolchain. The following
range of implementation variables was used:
• Target device: Xilinx Zynq-7000 XC7Z020.
• s: {2, 4, 8, 16, 32}.
• n (bits): 32 (signed, fixed-point).
• Data memory resource type: block random-access memory (BRAM).
• Checksum memory resource type: distributed RAM.
• Multiplier resource type: digital signal processing block (DSP).
• m (cycles): 15.
• a (cycles): 1.
Since triplets of DSPs—each optimally pipelined when they absorb five register
stages [68]—were required to implement each two-input 32-bit multiplier, 15-stage
multipliers were used consistently. Achieved fmax were found to be highest when
accumulators were not pipelined, allowing DSP block absorption, so single-stage
accumulators were used throughout.
3.3.1 Area
Table 3.1 contains the raw resource usage figures obtained for all implementations. Per-
centages of the total number of each of these resources for the target device are also in-
cluded, along with means of the individual proportions—each calculated as the mean (µ)
of LUT (%), flip-flop (FF) (%), BRAM (%) and DSP (%)—to give an indication of the
overall resource utilisation. Figure 3.7 presents a visual summary of the combined resource
usage data. Trendlines, shown dashed, have been included to counter the effects of CAD
tool noise.
It can be seen from Table 3.1 that BRAM and DSP overheads are fixed while register
and LUT overheads increase with s. Fixed BRAM overheads are due to the requirement
for three small (s, s and s + 1 32-bit words), separate dual-port RAMs for checksum
58
Matrix size ABFT Resource type
s enabled LUT FF BRAM DSP Total
2
✗ 239 210 2 6 1.02%
(0.449%) (0.197%) (0.714%) (2.73%)
✓ 618 476 5 9 1.69%
(1.16%) (0.447%) (1.79%) (4.09%)
4
✗ 441 406 8 12 2.38%
(0.829%) (0.382%) (2.86%) (5.45%)
✓ 757 749 11 15 3.22%
(1.42%) (0.704%) (3.93%) (6.82%)
8
✗ 604 794 16 24 4.63%
(1.14%) (0.746%) (5.71%) (10.9%)
✓ 877 1286 19 27 5.49%
(1.65%) (1.21%) (6.79%) (12.3%)
16
✗ 613 1566 30 48 8.79%
(1.15%) (1.47%) (10.7%) (21.8%)
✓ 1159 2352 33 51 9.85%
(2.18%) (2.21%) (11.8%) (23.2%)
32
✗ 2115 3105 58 96 17.8%
(3.98%) (2.92%) (20.7%) (43.6%)
✓ 3072 4472 61 99 19.2%
(5.77%) (4.20%) (21.8%) (45.0%)
Table 3.1: Baseline & ABFT-protected accelerator resource usage, containing the raw re-
source usage figures obtained for all implementations. Percentages of the total
number of each of these resources for the target device are also included, along
with means of those proportions to give an indication of the overall resource
utilisation.
storage, while the same number of extra DSPs is required for any s due to the need for a
single additional MAC in all cases. While additional register and LUT requirements both
increase with s, proportionately they decrease.
The impact upon area incurred through the introduction of checksum generation and
verification circuitry clearly shows it to be one of the most attractive properties of ABFT.
For the largest-implemented design, capable of multiplying pairs of 32× 32 matrices with
32-bit data elements, the overall area overhead was just 7.87%. Of this, the majority of the
overhead lies in the least-used resources, LUTs and FFs: more minimal overheads of 5.17%
and 3.13% were encountered for scarcer BRAMs and DSPs resources, respectively. Ap-
plications involving linear algebraic operations efficiently implemented on modern FPGAs
tend to be DSP-heavy, so the minimal DSP overhead is particularly beneficial.
3.3.2 Performance
Impacts in terms of performance are presented in Table 3.2. 10,000 tests were performed
for each value of s, with operating frequency kept at 50MHz throughout, and results
59
110
100
1000
2 4 8 16 32
∆
ar
ea
(%
)
Matrix size s
LUT
FF
BRAM
DSP
All
Figure 3.7: ABFT-protected accelerator resource usage overhead versus baseline, showing
the changes in resource utilisation for each design versus its equivalently sized
unprotected, baseline implementation.
were averaged across all tests completed. Figure 3.8 presents a visual summary of the
encountered latency overheads.
Slowdown over the baseline design is caused by the requirement to precompute row
checksums before multiplication itself takes place, while the column checksum generation
and checksum validation have no impact upon performance since they operate in parallel
with the MACs. The trend-reversal seen after s = 16 can be attributed to data transfer
throttling: once s passes 16, memory copies begin to dominate accelerator execution
for proportional runtime. Changes in fmax are largely unmeaningful and can likely be
attributed to the stochastic nature of the CAD tools used for compilation.
The latency overheads encountered were, while not huge, also not insignificant. For
the largest-implemented design, with s = 32 and n = 32, a latency penalty of 45.5%
was measured. A clear tradeoff exists across the spectrum of hardware fault tolerance
techniques: for example, while triple modular redundancy (TMR) necessitates very large
(> 200%) area overhead in comparison to ABFT’s small figures (< 10% for reasonably
sized matrices), its application forces virtually no latency penalty.
60
Matrix size ABFT Execution time fmax
s enabled (µs) (MHz)
2
✗ 254 90.2
✓ 272 77.9
4
✗ 314 80.1
✓ 339 81.7
8
✗ 348 76.9
✓ 546 77.1
16
✗ 497 50.5
✓ 1350 60.7
32
✗ 3100 58.2
✓ 4510 65.2
Table 3.2: Baseline & ABFT-protected accelerator performance. Averaged execution
times and maximum operating frequencies achieved are shown for each design
to allow side-by-side comparison of unprotected and protected implementations.
3.4 Fault Observability
In order to assess the fault observability of the chosen detection method, fault injection
simulations were performed to ascertain the hardware’s ability to correctly detect and
locate faults. Detectable faults are those that result in one or more checksum mismatches—
in one or more rows, columns or both—while those that are locatable cause mismatches
within the columns corresponding to the MACs they have affected.
The fault model chosen was that of individually targetted SA1 accumulator output
bits. Such faults were chosen since they are representative of a range of phenomena,
e.g. worn transistors or bridged interconnects. In terms of location, accumulator outputs
were chosen since these components lie at the ends of the datapaths of interest, affording
them maximal opportunity to impact the results. Note that the choice of fault model
was different to that used during hardware testing as explained in Section 3.2.5; while
that used for hardware testing used MAC output bit inversion in order to avoid masking,
software simulation uses stuck-ats since such avoidance was not desirable.
Results gleaned from this testing were, as is the case in the remainder of the fault
observability testing performed in this thesis, independent of fault rate, area and latency:
they demonstrate the proportions of total accelerator executions that should be expected
to result in particular classes of outputs under fixed fault conditions. They cannot be used
to directly ascertain the expected rates of certain output classes’ occurrence, although
this can be achieved by scaling the proportions to take fault rate, area and/or latency, as
required, into account.
Simulations were performed in software rather than in hardware since hundreds of mil-
61
020
40
60
80
100
120
140
160
180
2 4 8 16 32
∆
la
te
n
cy
(%
)
Matrix size s
Figure 3.8: ABFT-protected accelerator latency overhead versus baseline, showing the
change in execution time for each design versus its equivalently sized unpro-
tected, baseline implementation.
lions of tests could be run within a reasonable time period across a wide range of im-
plementations with relatively little up-front effort (versus hardware design) and minimal
intervention required at runtime. Since the faults emulated affect the operation being
performed at an algorithmic level, and that the fault tolerance being verified also acts at
that level, fault injection experimentation in hardware was considered to be unnecessary
and software simulation sufficient.
Algorithm 5 details the steps executed for each simulation. Arrays for s× s matrices A
and B are provisioned first (Lines 1 and 2), with the elements of each filled with random
n-bit signed values selected from a uniform distribution (Lines 3 and 4). Matrices Ac and
Br are created (Lines 5 and 6), with A and B encoded to form them as explained in
Section 2.7.1 (Lines 7 and 8) and the output array, representing matrix Cf, is provisioned
(Line 9). An (s + 1)-element bit mask array is created (Line 10) to represent the faults
present during the simulation. Across the array’s s+1 n-bit elements, number of simulta-
neous faults (f) bits, at random locations, are non-zero. Matrix multiplication to generate
Cf’s values—performed modulo-2
n to represent overflow—proceeds as normal but, both
before (Line 13) and during (Line 15) each multiply-accumulate step, the corresponding
62
column’s bit mask is ‘or’ed with the (intermediate) result to emulate either one or two
SA1 MAC register output bits. A result was recorded as detected (Line 19) if one or more
of the checksums present within Cf contained mismatches, and as located (Line 20) if,
additionally, those mismatches corresponded to the columns chosen for fault injection.
Algorithm 5 Fault injection simulation
1: create s× s matrix A
2: create s× s matrix B
3: A.rand fill()
4: B.rand fill()
5: create (s+ 1)× s matrix Ac
6: create s× (s+ 1) matrix Br
7: Ac ← A.add cs(‘col’)
8: Br ← B.add cs(‘row’)
9: create (s+ 1)× (s+ 1) matrix Cf
10: bit mask, faulty cols← generate bit mask(s, n, f)
11: for i = 0 to s do
12: for j = 0 to s do
13: Cf[i][j]← bit mask[j]
14: for k = 0 to s− 1 do
15: Cf[i][j]←
(
(C f[i][j] +Ac[i][k]×Br[k][j]) mod 2
n
)
bitwise or bit mask[j]
16: end for
17: end for
18: end for
19: detected← not Cf.check cs()
20: located← detected and Cf.diagnose cs() = faulty cols
Algorithm 6 details the procedure used to generate the bit mask called as
generate bit mask() in Algorithm 5. An array for the bit mask (Line 1) and another
to represent faulty columns (Line 2) are provisioned, with target column and bit pairs
randomly selected (Lines 7 and 8) and checked for uniqueness (Line 9) before being set
high (Line 10). The column affected is also flagged (Line 11).
The simulation framework was written in Python and threaded [69] to allow efficient
parallel execution on a 64-core Advanced Micro Devices (AMD) Opteron [70]-based server.
The test steps described were completed 1,024,000 times for each combination of the
following variables:
• s: {2, 4, 8, 16, 32}.
• n (bits): {2, 4, 8, 16, 32} (signed, fixed-point).
• Fault type: permanent.
• f : {1, 2}.
63
Algorithm 6 generate bit mask() procedure used in fault injection simulation
Require: s, n, f
1: create s+ 1 vector bit mask
2: create s+ 1 vector faulty cols
3: bit mask.zero fill()
4: faulty cols.zero fill()
5: for i = 0 to f − 1 do
6: repeat
7: col = rand(0 to s)
8: bit = rand(0 to n− 1)
9: until not bit mask[col] bitwise and 2bit
10: bit mask[col]← bit mask[col] bitwise or 2bit
11: faulty cols[col]← true
12: end for
13: return bit mask, faulty cols
This testing represented a total of several days of computional effort. Proportions of
detected and located faults for each value of s, n and f were averaged across all simulations
performed. The results are presented in Figure 3.9.
For single fault injection, in all cases except for s = 2, n = 2, the proportion of unde-
tected faults dropped off with both s and n. For s ≥ 16, undetectable fault proportions fell
below 0.1% for all data widths and, for s = 32, undetectable faults ceased to be encoun-
tered. As expected, proportions of unlocatable faults were higher than those that were
undetectable due to the lack of redundancy in checksums used for location. In all cases
except for s = 2, n = 2, however, the proportion of unlocated faults observed dropped
with increasing data width for each value of s. For larger s, the locatability of faults is
largely independent of s itself.
Similar trends were seen for double fault injection testing. The rates of both undetected
and unlocated faults were all lower, however, for each combination of s and n. This is ex-
pected of undetected faults since the likelihood of errors being masked in multiple columns
simultaneously decreases as the number of affected columns increases. The proportions
of unlocatable double faults encountered were again significantly higher than those which
were undetectable but, for all cases except for s = 2, n = 2, dropped off with increasing
data width for all s.
It should be noted that the results shown in Figure 3.9 do not take area into account;
that is, designs are subjected to single (or double) faults regardless of their physical size.
While time was not explicitly considered in the fault observability testing, it is nevertheless
accurate to say that designs of different area experience different fault rates under oth-
64
0.0001
0.001
0.01
0.1
1
10
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) Undetected single faults
0.0001
0.001
0.01
0.1
1
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) Undetected double faults
1
10
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) Unlocated single faults
0.01
0.1
1
10
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) Unlocated double faults
Figure 3.9: Fault proportions for ABFT-protected matrix multiplication. Each figure
shows the proportion of matrix multiplication results, for a particular s and
n, for which a particular outcome was observed.
erwise identical conditions. For this reason, additional plots, shown in Figure 3.10, were
produced to attempt to capture the effect of area upon likelihood of fault manifestation,
thereby scaling the previously seen fault proportions. Both s and n have an impact upon
area. For s, total ABFT-enabled resource usage figures from Table 3.1 were used to scale
the results, with s = 2 taken as the baseline. For example, an otherwise-equivalent accel-
erator with s = 4 consumes 3.22/1.69 = 1.91× the area of that with s = 2, and is therefore
considered likely to experience 1.91× the fault rate. For n, linear scaling was used, with
n = 2 taken as the baseline, for further scaling. The latter tends to penalise designs with
larger n since a proportion of logic, particularly that for the FSM, is independent of n,
however this was considered to be a minor concern.
The results presented in Figure 3.10 tend to show similar but flatter curves than those
in Figure 3.9. Area-scaled detectability results are promising, with proportions falling as
s rose in value under both single and double fault injection. Unfortunately, area-scaled
locatability was found to decrease as s increased, however the effect upon locatability of
increases in n were found to be either minimal (for single fault testing), with all curves
65
0.001
0.01
0.1
1
10
2 4 8 16 32
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) Undetected single faults
0.001
0.01
0.1
1
10
2 4 8 16 32
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) Undetected double faults
1
10
100
1000
2 4 8 16 32
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) Unlocated single faults
1
10
100
1000
2 4 8 16 32
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Data width n (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) Unlocated double faults
Figure 3.10: Fault proportions for ABFT-protected matrix multiplication scaled by area,
with total resource usage figures used to scale for s and linear scaling used
for n.
flattening off around n = 8 apart from that for s = 2, or positive (for double fault testing).
3.5 Conclusion
This chapter detailed the design, implementation and evaluation of an ABFT-protected
matrix multiplication accelerator running in hardware on a hybrid CPU-FPGA SoC plat-
form. Specifics of the hardware-software platform developed were presented first, followed
by the baseline accelerator and modifications made to harden the design against faults.
Architectural details governing the inference of fault locations and the method of error
injection were also given. Results were presented with an emphasis on implementational
overhead in terms of area, frequency and latency, and a fault injection simulation method
was described and used to evaluate the fault observability of the developed system.
The results showed significant promise for ABFT’s application in hardware: for the
physically largest-implemented (s = 32, n = 32) accelerator tested, the area overhead
incurred was 7.87% averaged across all resources, comparing favourably with competing
66
fault tolerance techniques such as TMR. For the same s and n, the latency penalty incurred
was 45.5%, and fault injection simulation suggested single datapath fault locatability of
96.7% with detectability well in excess of 99%.
The work presented in this chapter represents a solid foundation upon which further
enhancements could, and can still, be made. In particular, the ABFT implementation for
error detection and fault location, as well as the simulation framework for fault injection,
completed represent significant output upon which the work presented in Chapters 4, 5
and 6 is based.
67
4 Error Correction via Runtime
Resource Reallocation
4.1 Introduction
In this chapter, the foundational work presented in Chapter 3, which focussed upon er-
ror detection and fault location inference, is built upon to allow those errors to be cor-
rected using two distinct strategies. A combination of ‘bolt-on’ algorithm-based fault
tolerance (ABFT) error detection logic and resource reallocation serve to provide low-
overhead datapath fault tolerance at runtime. Initially, the latter is achieved through
the use of additional logic which, guided by information gleaned from ABFT, reduces
algorithmic parallelisation at runtime in order to maintain accurate operation. Field-
programmable gate arrays (FPGAs) are uniquely placed to allow further area savings to
be made when incorporating error correction mechanisms thanks to their dynamic re-
configurability; dynamic partial reconfiguration (DPR) is therefore called upon for the
purpose of error correction as well. For ease of comparison, the benchmark, platform and
design tools used for the work described in this chapter are identical to those in Chapter 3.
The results in this chapter demonstrate that rapid yet accurate fault diagnoses along
with low hardware (area), performance (latency) and, where necessary, software (memory
storage) penalties are achievable through algorithm-tailored error correction strategies.
The results shown in this chapter include that, for the largest-implemented circuit able
to detect, diagnose and correct errors within its datapath, area overheads of 12.4% (with
additional logic for error correction) or 10.1% (with DPR) are achievable in return for
latency penalties as low as 24.5% under fault-free operation.
4.1.1 Contributions
The original contributions of the work presented in this chapter are:
68
• The first implementation of custom logic for error correction in the presence of faulty
resources guided by an ABFT error detection mechanism.
• The first implementation of ABFT-protected hardware using DPR for recovery.
• A quantitative analysis of the overheads—of resources, performance and memory—
incurred through the incorporation of those error correction strategies into a bench-
mark hardware accelerator.
4.1.2 Publications
The work presented in this chapter has been peer-reviewed and appeared in the 2014
proceedings of the IEEE International Symposium on Field-programmable Custom Com-
puting Machines (FCCM) [71] and International Conference on Field-programmable Logic
and Applications (FPL) [72].
4.1.3 Outline
The remainder of this chapter is organised as follows. Section 4.2 details the development
and functionality of two alternative hardware error correction strategies, with Section 4.2.1
considering the use of additional logic to achieve this aim and Section 4.2.2 making use of
DPR to achieve the same goal. Overheads of the two methods are analysed in Section 4.3,
with area considered in Section 4.3.1 and performance in Section 4.3.2. For DPR-facilitated
error correction, an additional overhead—memory utilisation—is studied. Concluding
comments are given in Section 4.4.
4.2 Implementation
4.2.1 Additional Logic
Once one or more errors have been detected in the accelerator described in Chapter 3,
and the resources causing those errors identified, the matrix multiplication can be rerun
in a modified fashion such that the faulty resources are bypassed. Initially, a data-shifting
strategy was employed: by effectively reducing the level of parallelism, i.e. not making use
of all previously available multiply-accumulators (MACs), and dynamically reallocating
data to the remaining resources, correct computation can be achieved at the expense
of elongated computation time. To achieve this, the datapath shown in Figure 3.4 was
69
modified to that shown in Figure 4.1: the two are identical save for the two circular
shifters, shown in Figure 4.3, present in the latter.
Figure 4.2 demonstrates the operation of this data-shifting strategy to route around a
single faulty MAC in an s = 2 matrix multiplier. In each case, Br input data—signified
by boxed numbers which correspond to their original places—is captured in the circular
shifter inserted onto the path for Br as shown in Figure 4.1. This additional logic is
capable of rotating input data ‘downwards’ by x places in x clock cycles before it is
fed to the MACs—signified by circled numbers—for processing. No modifications to the
flow of Ac input data are required since the same values of Ac are presented to all MACs
simultaneously. Post-computation, the second circular shifter, this one configured to rotate
data ‘upwards,’ replaces the buffering register shown in Figure 3.4 that captures rows of
Cf. In the presence of a single faulty MAC, a single-place data shift is all that is required
to bypass it.
In
p
u
t
R
A
M
2s× ns
n
s
C
h
ec
k
su
m
g
en
er
a
ti
o
n
lo
g
ic
Ac
n
n
(s
+
1
)
Br

n(s+ 1)
b
× + b
b
× + b
..
.
..
.
..
.
× + b
Cf
n
(s
+
1
)

n(s+ 1)
b
n
s
O
u
tp
u
t
R
A
M
s× ns
n
(s
+
1
)
C
h
ec
k
su
m
v
er
ifi
ca
ti
o
n
lo
g
ic
Figure 4.1: ABFT-enabled datapath with circular shifters, allowing the dynamic realloca-
tion of data to resources.
Note that in all three cases the operation is the same: during the first execution, input
data is rotated downwards by one place, computed, and output data rotated upwards
by one place to correct for the input shift. During the second execution, no shifting is
required, however only the result from the MAC directly above that found to be faulty
is stored, overwriting the value outputted during the first execution. These steps remain
the same for any value of s.
While the current control hardware is only able to work around single faulty MACs,
the same shifting logic is capable of performing data rotations to prevent the use of any
number of faulty MACs up to s, i.e. all-but-one unavailable. Again, only latency would
be affected by such occurrences. Latency would scale proportionately to the number of
70
Step 1 Step 2
MAC 1 faulty
1
2
3
3
1
2
1
2
3
–
1
2
1
2
–
1
2
3
1
2
3
1
2
3
MAC 2 faulty
1
2
3
3
1
2
1
2
3
3
–
2
–
2
3
1
2
3
1
2
3
1
2
3
MAC 3 faulty
1
2
3
3
1
2
1
2
3
3
1
–
1
–
3
1
2
3
1
2
3
1
2
3
Figure 4.2: Resource reallocation for s = 2 with single fault using circular shifters. Br
input data—signified by boxed numbers which correspond to their original
places—is captured in the circular shifter inserted onto the path for Br, ro-
tated and fed to the MACs—signified by circled numbers—for processing. The
mirrored equivalent occurs on the output side for Cf.
faults that need to be tolerated, since multiple faults will require multiple data shifts to
be routed around.
It should be emphasised that the accelerator does not retain state between computations.
For this reason, the hardware defaults to its normal, fault-unaware mode at the beginning
of every computation: only if one or more errors are detected does it run in a fault-
bypassing state. A more practical implementation might retain error counts for each MAC
and, once one or more of those counts cross a predetermined threshold value, prevent
any use of the associated MACs from that point onward, thus decreasing the average
computation time.
4.2.2 Partial Routing Reconfiguration
While the modified accelerator described in Section 4.2.1 made use of additional logic to
dynamically reallocate data to the MACs in order to bypass faults, in this section the
means to achieve the same end result with partial routing reconfiguration are presented.
During accelerator executions in which at least one error is detected, fault location data is
sent back to the controlling software driver in order to facilitate corrective action. Based
upon the locations of faults observed and, conversely, the locations of remaining functional
71
Br or Cf in
n(s+ 1)
Shift
Capture
s+ 1
en b
en b
en b
.
..
Br or Cf out
n(s+ 1)
Figure 4.3: Circular shifter, capable of rotating data by x places in x clock cycles and used
to allow dynamic resource reallocation.
MACs, one or more rounds of routing reconfiguration followed by accelerator executions,
together called ‘corrective executions,’ can be performed in order to re-establish accurate
operation.
Routing reconfiguration, rather than dynamic relocation of MACs themselves, was cho-
sen for three reasons:
• The datapath (MACs) represents the vast majority—well over 90% in designs with
higher s—of the area consumed by the accelerator.
• Routing bitstreams are small, as is quantified in Section 4.3.3, and so can be applied
more quickly (and, consequently, frequently) than those for MAC reconfiguration
could be.
• The reservation of regions of fabric (to accommodate replacement MACs) is rendered
unnecessary, allowing full use of the available resources.
In order to facilitate routing reconfiguration via DPR, several modifications needed
to be made to the system described in Section 3.2 and accompanying block diagram,
Figure 3.1. These are shown in Figure 4.4. A region of the accelerator—represented by a
dashed rectangle—is made reconfigurable. Reconfiguration is handled by one of the hard
Acorn reduced instruction set computer machine (ARM) central processing unit (CPU)
cores on the Zynq system-on-chip (SoC) through the processor configuration access port
(PCAP) [73].
72
DRAM
ARM core
DRAM
controller
PS
PL
Interrupt
controller
Config.
port
AXI4-Lite
interface
AXI4
interface
DMA
controller
Memory
controller
Accelerator
Memory
controller
b
b
b
b
Figure 4.4: System block diagram with DPR. A region of the accelerator—represented by
a dashed rectangle—is made reconfigurable. Reconfiguration is handled by one
of the hard ARM CPU cores on the Zynq SoC through the PCAP [73].
Reconfiguration is considerably slower than modifying multiplexer (MUX) addressing—
even for small, partial bitstreams—due to the need to set up and execute a memory transfer
each time a new configuration is to be loaded. For this reason, the datapath shown in
Figure 4.1 was modified further to, as shown in Figure 4.5, relocate the checksum verifi-
cation logic such that it took its source from the output random-access memory (RAM)
rather than the (buffered) MAC outputs. With this arrangement, partially faulty outputs
need only be partially overwritten, thus saving reconfiguration cycles. Whereas previously
an s = 2 accelerator required six changes in MUX addressing to route around a single
fault—two changes per row computed—during an execution, only two reconfigurations
are needed with this arrangement as the routing can stay the same while correcting all
errors caused by that fault before being reset prior to the next computation.
The relocation of checksum verification logic also necessitated the expansion of the
output RAM. Whereas until now the output RAM only needed to store the s× s output
matrix C, not the output checksum elements, the output RAM shown in Figure 4.5 needs
to store the full (s+1)× (s+1) output matrix Cf. To achieve this with zero impact upon
the software, which expects to receive only C as output, a hybrid RAM was designed.
Within it, the first s rows and columns are stored in a dual-ported RAM, intended to
be implemented in block random-access memory (BRAM) as before, whose read port is
accessible from the processor subsystem (PS) via a memory controller. Two small RAMs,
intended to be implemented in distributed RAM, were added to store the elements within
73
the (s + 1)th row and column of the output matrix. Control logic was added such that,
from the accelerator side, all three RAMs appeared to be a single, contiguous storage block
for the entitiry of Cf.
In
p
u
t
R
A
M
2s× ns
n
s
C
h
ec
k
su
m
g
en
er
a
ti
o
n
lo
g
ic
Ac
n
n
(s
+
1
)
Br
b
× + b
b
× + b
.
..
.
..
.
..
× + b
Cf
n
(s
+
1
)
O
u
tp
u
t
R
A
M
(s+ 1)×
n(s+ 1)
C
h
ec
k
su
m
v
er
ifi
ca
ti
o
n
lo
g
ic
n
(s
+
1
)
Figure 4.5: ABFT-enabled datapath with DPR. The input and output halves of the recon-
figurable partition are shown as dashed rectangles. The movement of check-
sum verification logic from the input to output side of the output RAM is also
shown, where the output RAM now stores the full (s + 1) × (s + 1) output
matrix Cf.
At compile-time, routing configurations representing different amounts of data-shifting
are compiled along with the rest of the design, which remains static. Nets Br and Cf—
shown in Figure 4.5—are broken and routed via a single reconfigurable partition, which
then dictates the data connections on both the input and output sides of the accelerator’s
datapath. Figure 4.6 shows the configurations available for the multiplier when s = 2.
Circled numbers represent MACs, while the input and output halves of the reconfigurable
partition are shown as dashed rectangles. In each case, the output shifting arrangement
mirrors that on the input side.
Zero-place shift
1
2
3
Single-place shift
1
2
3
Double-place shift
1
2
3
Figure 4.6: Routing configurations available for s = 2. Circled numbers represent MACs,
while the input and output halves of the reconfigurable partition are shown as
dashed rectangles. In each case, the output shifting arrangement mirrors that
on the input side.
When routing reconfiguration is required, the accelerator’s driver initiates a partial
bitstream transfer, via the PCAP, from dynamic random-access memory (DRAM) to
74
the FPGA fabric. In order to lower the total number of configurations required, only
configurations with equal-place shifting per MAC are generated at compile-time. The
number of configurations stored for each accelerator is therefore s+ 1.
The driver supports two levels of safety for error correction. When operating in the
safer mode, all incorrectly computed columns of Cf are recalculated, after which checksum
verification is repeated to confirm successful correction. In the less safe mode, the ABFT
mechanism is essentially turned off: the (s + 1)th MAC becomes a usable spare and the
output is assumed to be accurate once all corrective executions complete. As a consequence
of this, faults that affect only the (s + 1)th MAC are ignored in the less safe mode. The
choice made between these modes as part of a larger application would be based upon the
likelihood of additional faults developing in different MACs during the time it takes to
complete a corrective execution. Note that re-transfer of input data and input checksum
regeneration are not required in either mode.
Figure 4.7 demonstrates the application of routing reconfiguration in order to bypass
a single faulty MAC—labelled 2—when s = 2. Intuitively, one corrective execution is
required to overwrite the second column’s elements. A single-place shift allows MAC 3 to
perform the recalculation required.
Step 1
1
2
3
Step 2
1
2
3
Figure 4.7: Resource reallocation for s = 2 with single fault using DPR, demonstrating
the application of routing reconfiguration in order to bypass a single faulty
MAC—labelled 2—when s = 2. A single-place shift allows MAC 3 to perform
the recalculation required.
In cases of multiple faults, differing amounts of data-shifting are required. This is
exemplified in Figure 4.8, in which six different combinations of double-fault locations are
shown for s = 4. In the three leftmost cases, one single-place shift is required, while in
the three rightmost cases, one double-place shift is required. Curved arrows represent the
reallocation of resources necessary during a corrective execution.
Intuition may suggest that the number of corrective executions required is only depen-
dant upon the ratio of faulty to functional MACs. When s ∈ {2, 4}, this is indeed true, but
for s ≥ 8 the situation is more complicated since there are cases in which a configuration
75
Single-place shift required
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Double-place shift required
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Figure 4.8: Example double-fault routing reconfigurations when s = 4. In the three left-
most cases, one single-place shift is required, while in the three rightmost cases,
one double-place shift is required. Curved arrows represent the reallocation of
resources necessary during a corrective execution.
with an equal-place shift per MAC can no longer match all faulty MACs to remaining
functional ones. In Figure 4.9, where s = 8, six combinations of quadruple-fault locations
are shown. In the three leftmost cases, only a single corrective execution is required; in
the three rightmost cases, however, two are needed: resource reallocations which cannot
be performed in the first execution are represented by dashed lines.
1 corrective run needed
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
2 corrective runs needed
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
Figure 4.9: Example quadruple-fault routing reconfigurations when s = 8, demonstrating
cases where the same number of faults require differing numbers of corrective
executions to be bypassed.
4.3 Overheads
Experiments were performed to assess the impacts of adding ABFT-informed error cor-
rection, implemented both with additional logic and DPR, to the fault-intolerant, baseline
accelerator in terms of area and performance. All designs were compiled using version
76
14.7 of Xilinx’s Integrated Synthesis Environment (ISE) toolchain. The following range of
implementation variables was used:
• Target device: Xilinx Zynq-7000 XC7Z020.
• s: {2, 4, 8, 16, 32}.
• Data width (n) (bits): 32 (signed, fixed-point).
• Data memory resource type: BRAM.
• Checksum memory resource type: distributed RAM.
• Multiplier resource type: digital signal processing block (DSP).
• Multiplier latency (m) (cycles): 15.
• Accumulator latency (a) (cycles): 1.
4.3.1 Area
Table 4.1 contains the raw resource usage figures obtained for all implementations—
including the fault-intolerant, fault-tolerant via additional logic and fault-tolerant via DPR
versions of the accelerator—per resource type. Percentages of the total number of each of
these resources for the target device are also included, along with a mean of the individual
proportions to give an indication of the overall resource utilisation. Figures 4.10 and 4.11
present a visual summary of the resource usage data, with the former showing overheads
of individual resources and the latter overheads across all resource types.
With the exception of s = 16, it is clear from Figure 4.11 that the DPR-shifting accel-
erator performs better than its additional logic-shifting counterpart for overall resource
usage across the range of s tested. Since BRAM and DSP usage are identical between
the two versions, look-up table (LUT) and flip-flop (FF) counts are responsible for all
differences in utilisation. As expected, FF overhead for the DPR design decreases pro-
portionally as s increases thanks to the elimination of the circular shifters present in the
additional logic-shifting version. Conversely, LUT overhead tends to increase slightly; this
is due to LUTs being used to implement distributed RAMs for output checksum storage
as described in Section 4.2.2. For the largest-tested design, s = 32, the DPR-shifting
accelerator achieved an overall area overhead of 10.1%—17.7% lower than its additional
logic-shifting equivalent. Between these two fault-tolerant designs, FF overhead decreased
by 77.5% while LUT overhead increased by 7.7%.
77
10
100
1000
2 4 8 16 32
∆
ar
ea
(%
)
Matrix size s
Additional logic
DPR
(a) LUT
10
100
1000
2 4 8 16 32
∆
ar
ea
(%
)
Matrix size s
Additional logic
DPR
(b) FF
1
10
100
1000
2 4 8 16 32
∆
ar
ea
(%
)
Matrix size s
Additional logic
DPR
(c) BRAM
1
10
100
2 4 8 16 32
∆
ar
ea
(%
)
Matrix size s
Additional logic
DPR
(d) DSP
Figure 4.10: ABFT-protected accelerator with additional logic & DPR error correction
resource usage overhead versus baseline, showing the change in individual
resource utilisations for each design versus its equivalently sized unprotected,
baseline implementation.
4.3.2 Performance
Testing was performed on the hardware in order to measure its impact upon performance
under a number of conditions. Table 4.2 summarises the results of all of these performance
tests. Each test was completed 10,000 times; the mean of these executions is given in all
cases. Prior to each test, new uniformly distributed random input data was generated
to form A and B. Execution times were measured using a cycle-accurate ARM timer
peripheral. In all cases, the FPGA fabric was clocked at 50MHz. Included in Table 4.2 are
execution times for the fault-intolerant multiplier, the fault-tolerant via additional logic-
shifting accelerator and DPR-enabled version running in both of its operating modes.
Where appropriate, latency increases relative to the equivalently sized fault-intolerant
design are given for comparison. Execution times are given for the occurrences of singular
and double MAC failures—the former for both the additional logic-shifting and DPR
hardware, and the latter for the DPR version only. Permanent faults were emulated
through the targetted inversion of a single accumulator output bit within either one or
78
10
100
2 4 8 16 32
∆
ar
ea
(%
)
Matrix size s
Additional logic
DPR
Figure 4.11: ABFT-protected accelerator with additional logic & DPR error correction
combined resource usage overhead versus baseline, showing the change in
combined resource utilisation for each design versus its equivalently sized
unprotected, baseline implementation.
two MACs per execution, with fault locations also randomly chosen. Plots of the latency
increases over the fault-intolerant hardware under fault-free, singly and doubly faulty
conditions are given in Figure 4.12.
The results show that, for all s > 4, the DPR-shifting accelerator outperforms the
additional logic-shifting version under normal, fault-free operation as well as that in the
presence of a single failure. When comparing the performance of the two error correction
implementations side-by-side, recall that the routing’s construction is not the only imple-
mentational difference them. The lower penalties seen during fault-free operation are due
to the relocation of the checksum verification logic from the input to the output side of the
output RAM described in Section 4.2.2, allowing return programmable logic (PL)-to-PS
data transfers to begin (and end) sooner than they had previously, while gains under single
failure mode are realised for larger s as reconfiguration times proportionately fall. The
relationship between the performance plots for the DPR-shifting version working in its two
modes demonstrates the near-fixed performance cost paid by operating more safely. The
trend-reversal seen on all plots after s = 16 can be attributed to data transfer throttling:
79
30
40
50
60
70
80
90
100
2 4 8 16 32
∆
la
te
n
cy
(%
)
Matrix size s
Additional logic
DPR (less safe)
DPR (more safe)
(a) Fault-free
20
30
40
50
60
70
80
90
2 4 8 16 32
∆
la
te
n
cy
(%
)
Matrix size s
Additional logic
DPR (less safe)
DPR (more safe)
(b) Single failure
30
35
40
45
50
55
60
2 4 8 16 32
∆
la
te
n
cy
(%
)
Matrix size s
DPR (less safe)
DPR (more safe)
(c) Double failure
Figure 4.12: ABFT-protected accelerator with additional logic & DPR error correction
latency overhead versus baseline, showing the change in execution time for
each design versus its equivalently sized unprotected, baseline implementation
under a range of fault conditions.
once s passes 16, memory copies begin to dominate accelerator execution for proportional
runtime. Performance impacts arising from the use of partial reconfiguration are negligi-
ble due to the bitstreams’ small size and infrequent application per accelerator execution.
For the largest-tested design, s = 32, the DPR-shifting accelerator incurred a 24.5% la-
tency penalty under fault-free operation—46.1% lower than its additional logic-shifting
equivalent.
Timing model-inferred maximum operating frequency (fmax) changes are not well cor-
related, likely due to the stochastic nature of the placement and routing tools used, al-
though decreases between the additional logic- and DPR-shifting designs, likely due to
path-lengthening incurred through the reconfigurable partition, are seen for larger s.
4.3.3 Memory
From a software perspective, the primary overhead of the DPR-based fault tolerance strat-
egy is partial bitstream storage. Since accelerator data and bitstream transfers, as well as
80
accelerator executions, are interrupt-driven, their impacts upon CPU performance are neg-
ligible. Table 4.3 summarises the DRAM storage requirements for each value of s tested.
The size of each partial bitstream is given along with the total storage requirement for
that value of s. The memory occupation is also expressed, for each s, as a proportion of
the DRAM available (512MB) on the development board used, an Avnet ZedBoard [74].
4.4 Conclusion
In this chapter, two error correction strategies using additional logic and DPR to achieve
runtime resource reallocation were presented, both of which are capable of routing data
around resources found to be faulty at runtime by ABFT error detection circuitry. Imple-
mentational details were described first, followed by results of experiments performed to
assess the hardware’s overheads in terms of area, performance and, for the fault-tolerant
accelerator using DPR, memory utilisation.
For the largest-implemented design, capable of multiplying pairs of 32×32 matrices with
an inner loop-unrolled accelerator, area overheads of 12.4% and 10.1% were encountered
through the use of additional logic and DPR for achieving resource reallocation, respec-
tively; a 22.8% reduction for the latter. The additional logic-shifting design with the same
s experienced a 45.5% latency penalty during fault-free operation, while its DPR-shifting
version achieved 24.5%; a 46.2% reduction, suggesting that DPR betters the use of addi-
tional logic in terms of both area and performance for larger-sized accelerators. Again for
s = 32, mean performance penalties found under faulty conditions were 51.0% and 77.1%
over the baseline, fault-free execution time for single and double faults, respectively, for
operating in the less safe mode. When operating in the more safe mode, those figures rose
to 75.5% and 102%, respectively.
The work presented in this chapter has demonstrated that error detection and correction
based upon ABFT, and in particular the combination of ABFT and DPR, is a powerful
contendor for low-overhead hardware fault tolerance. The application of error detection at
lower levels of precision, thereby introducing an area-to-allowed error tradeoff, is explored
in Chapter 6.
81
Matrix size ABFT Fault Resource type
s enabled avoidance LUT FF BRAM DSP Total
✗ 239 210 2 6 1.02%
(0.449%) (0.197%) (0.714%) (2.73%)
2
Additional 665 597 5 9
1.92%
✓ logic (1.25%) (0.561%) (1.79%) (4.09%)
DPR
711 406 5 9
1.90%
(1.34%) (0.382%) (1.79%) (4.09%)
✗ 441 406 8 12 2.38%
(0.829%) (0.382%) (2.86%) (5.45%)
4
Additional 855 945 11 15
3.31%
✓ logic (1.61%) (0.888%) (3.93%) (6.82%)
DPR
898 620 11 15
3.25%
(1.69%) (0.583%) (3.93%) (6.82%)
✗ 604 794 16 24 4.63%
(1.14%) (0.746%) (5.71%) (10.9%)
8
Additional 2037 1625 19 27
6.10%
✓ logic (3.83%) (1.53%) (6.79%) (12.3%)
DPR
1375 1036 19 27
5.65%
(2.58%) (0.974%) (6.79%) (12.3%)
✗ 613 1566 30 48 8.79%
(1.15%) (1.47%) (10.7%) (21.8%)
16
Additional 1674 2970 33 51
10.2%
✓ logic (3.15%) (2.79%) (11.8%) (23.2%)
DPR
2273 1874 33 51
10.3%
(4.27%) (1.76%) (11.8%) (23.2%)
✗ 2115 3105 58 96 17.8%
(3.98%) (2.92%) (20.7%) (43.6%)
32
Additional 4203 5643 61 99
20.0%
✓ logic (7.90%) (5.30%) (21.8%) (45.0%)
DPR
4363 3675 61 99
19.6%
(8.20%) (3.45%) (21.8%) (45.0%)
Table 4.1: Baseline & ABFT-protected accelerator with additional logic & DPR error cor-
rection resource usage, containing the raw resource usage figures obtained for
all implementations. Percentages of the total number of each of these resources
for the target device are also included, along with means of those proportions
to give an indication of the overall resource utilisation.
82
Matrix size ABFT Fault Execution time (µs) fmax
s enabled avoidance Fault-free Single failure Double failure (MHz)
2
✗ 254 90.204
Additional logic 272 300 88.992
✓ DPR (less safe) 280 366 451 95.712
DPR (more safe) 280 392 477 95.712
4
✗ 314 80.103
Additional logic 339 398 88.168
✓ DPR (less safe) 351 448 544 82.102
DPR (more safe) 351 486 581 82.102
8
✗ 348 76.941
Additional logic 546 712 77.042
✓ DPR (less safe) 422 557 690 85.918
DPR (more safe) 422 631 764 85.918
16
✗ 497 50.495
Additional logic 1350 1910 56.850
✓ DPR (less safe) 710 982 1254 52.062
DPR (more safe) 710 1195 1467 52.062
32
✗ 3100 58.156
Additional logic 4510 6600 55.857
✓ DPR (less safe) 3860 4680 5490 53.101
DPR (more safe) 3860 5440 6260 53.101
Table 4.2: ABFT-protected accelerator with additional logic & DPR error correction
performance. Averaged execution times and maximum operating frequencies
achieved are shown for each design to allow side-by-side comparison of unpro-
tected and the range of protected implementations.
Matrix size Bitstream size (kB)
s Each Total
2 15.2 45.7 (0.00871%)
4 29.4 147 (0.0281%)
8 43.6 393 (0.0749%)
16 87.1 1480 (0.282%)
32 158 5220 (0.995%)
Table 4.3: ABFT-protected accelerator with DPR error correction bitstream storage re-
quirements, summarising the DRAM partial bitstream storage requirements for
each s tested. The size of each partial bitstream is given along with the total
storage requirement, absolute and proportional, for that value of s.
83
5 Fault Observability for Matrix & DSP
Operations
5.1 Introduction
In this chapter, work completed to generalise the fault observability testing introduced in
Chapter 3 is presented. Three common matrix manipulation algorithms, each highly suited
to hardware acceleration—one of which being representative of linear filtering operations—
are studied in detail from an implementational perspective to gauge their susceptibility to,
observability of and recoverability from faults occurring within their datapaths. Results
presented herein capture the impacts of differing operating conditions along with the range
of parameters available to be specified by a hardware designer. Rather than by hand,
they could equally be used by an automated design tool capable of creating low-overhead
fault-tolerant hardware from high-level functional descriptions, thus allowing informed
decisions to be made. In all cases in this chapter, an inner loop-unrolled (i.e. (s + d)
parallel multiply-accumulators (MACs) or adders, depending on the operator) hardware
architecture is assumed.
5.1.1 Contributions
The original contributions of the work presented in this chapter are:
• A software framework for fault simulation in hardware-accelerated linear algebra
operators protected with algorithm-based fault tolerance (ABFT).
• A thorough analysis of the fault tolerance of three benchmark ABFT-protected op-
erators.
• The first consideration of distance-x, for x > 2, ABFT application in custom logic.
84
5.1.2 Outline
The remainder of this chapter is organised as follows. Section 5.2 details the fault in-
jection simulation method devised for ascertaining fault observability. Sections 5.3, 5.4
and 5.5 describe the results of the fault observability testing performed for three target
operations: matrix-matrix multiplication (Section 5.3), matrix addition (Section 5.4) and
matrix-vector multiplication (Section 5.5). In each of the latter three sections, results are
presented and analysed across a wide range of variables. The chapter is summarised in
Section 5.6.
5.2 Method
Software simulations were performed to assess the fault observabilities of several operators
across the following range of variables:
• Operator: {matrix-matrix multiplication, matrix addition, matrix-vector multipli-
cation}.
• Fault type: {permanent, transient}.
• Number of simultaneous faults (f): {1, 2, 3}.
• d: {1, 2, 3}.
• s (or vector length): {2, 4, 8, 16, 32}.
• Data width (n) (bits): {2, 4, 8, 16, 32}.
For each combination of these, the steps detailed in either Algorithm 7 (for matrix-matrix
multiplication), 10 (for matrix-matrix addition) or 11 (for matrix-vector multiplication)
were repeated 960,000 times.
In all cases, the fault model applied was that of individually targetted stuck-at-one (SA1)
accumulator output (for multiplicative operations) or adder (for additional) bits. For
permanent faults, such SA1s are representative of, for example, worn transistors or bridged
interconnects, while they mimic effects including register and memory upsets in the case
of transients.
Results gleaned from this testing were, as is the case in the all of the fault observability
testing performed in this thesis, independent of fault rate, area and latency: they demon-
strate the proportions of total accelerator executions that should be expected to result in
85
particular classes of outputs under fixed fault conditions. They cannot be used to directly
ascertain the expected rates of certain output classes’ occurrence, although this can be
achieved by scaling the proportions to take fault rate, area and/or latency, as required,
into account.
Algorithm 7 details the steps performed for each matrix-matrix multiplication simula-
tion. The procedure is largely similar to that shown in Algorithm 5 in Section 3.4, but
expanded to support transient as well as permanent faults—selectable via the fault type
variable—and generalised for d. Changes were also made to allow for more detailed re-
sult classification. Following the creation of randomly filled matrices A and B, reference
matrix Cref—containing the correct result—is created (Line 5) and calculated (Line 6).
Checksumming is performed on A and B to create Ac and Br (Lines 9 and 10) with d
additional rows (for Ac) or columns (for Br) added as shown in Section 2.7.1. Once Cf
is provisioned (Line 11), control splits depending on whether fault type indicates that
permanent (from Line 12) or transient (from Line 22) fault emulation is required. For
permanent faults, a one-dimensional (vector) bit mask is created (Line 13) to represent
the faults. This is an (s + d)-element array of n-bit zero-initialised values with f ‘1’s
randomly scattered throughout. Computation proceeds with column j’s bit mask applied
both before (Line 16) and during (Line 18) each multiply-accumulation. Transient faults
each affect a single bit during a single multiply-accumulation step only, so a larger bit
mask is created (Line 23) to represent them. This is an
(
(s+d)×(s+d)×(s+d)
)
-element
three-dimensional array; hence the 3 in the procedure call. Calculation proceeds with
the bit mask value for the particular row, column and multiply-accumulation step applied
before (Line 26) and during (Line 28) each iteration. Note that the zeroth elements in the
outermost dimension correspond to faults that occur during the resetting of variables be-
fore accumulation commences. Following fault-emulated computation, the data elements
in Cf are compared with Cref (Line 33) and checksums verified (Line 34) in order to clas-
sify the result by calling procedure classify result() (Line 35). Finally, checksums are
compared to the faulty columns array created in order to determine whether or not faults
were successfully located (Line 36).
Procedure gen bit mask() is detailed in Algorithm 8. Depending on whether
dimensions is 1, 2 or 3, a one-, two- or three-dimensional array of n-bit values, with
s + d elements per dimension, is created. Initially these values are zero, but f ‘1’s
are scattered randomly throughout them to represent faults. In multiply-accumulation
86
with permanent fault emulation, for example, each ‘1’ is representative of a single bit
of a single MAC’s registers remaining high throughout a complete multiplication. For
transient fault emulation with the same operator, however, each ‘1’ represents a single
bit of a MAC register being forced high for a single clock cycle.
Algorithm 9 details the steps involved in procedure classify result(), which are
independent of the operation performed. This procedure serves to classify the result,
as explained in Section 2.7.3, as either detected, erroneous (false positive), missed (false
negative) or undetectable (masked).
Algorithm 10 shows the steps taken to emulate faulty matrix addition. The layout is
identical to that in Algorithm 7, with relatively minor changes made to suit the different
operator. Element-wise addition is performed modulo-2n (Line 6) to produce Cref, with
d rows and columns of checksums added to both A and B to form Af and Bf (Lines 9
and 10), as exemplified in Section 2.7.2. Under transient fault injection, a two-, rather
than three-, dimensional bit mask array is created (Line 20) and applied (Line 23) since
matrix addition is a two-, rather than three-, loop procedure.
The simulation of matrix-vector multiplication is detailed in Algorithm 11. The steps
here are also similar to those in Algorithms 7 and 10, with changes made to suit the
checksumming steps exemplified in Section 2.7.2. s-element vector a is created (Line 1)
rather than a matrix, with checksumming—row-wise—performed only upon B to create
Br (Line 8). Fault-free and -prone outputs cref and cr, created on Lines 5 and 9, respec-
tively, are also vectors rather than matrices. Following multiplication, results are classified
(Line 29) as normal but fault location is not determined since cr, a row vector, does not
contain any column-wise checksums from which to determine location; faults are therefore
only able to be detected within a matrix-vector multiplication.
5.3 Matrix-matrix Multiplication
Figure 5.1 summarises the results of matrix-matrix multiplication simulations performed
under the influence of permanent faults. Proportions of detected faults are displayed
relative to s, f and d; n was found to have little impact upon the results in this case, so
those obtained were averaged across the range of n tested. The plots of detected faults
show that their proportions increase with s and f but decrease with d.
False positives account for the majority of results not classified as detected. n was
also found to be of little significance to these results; it is therefore not shown here. The
87
prevalence of false positives is largely controlled by the ratio of checksum to total elements
within each matrix row (or column). In the s = 2, f = 1, d = 1 case, for example, this is
1/3—approximately the value of the corresponding point in Figure 5.1. Since d increases
this ratio, false positives become more common as it scales; as f increases, however, they
become scarcer since multiple faults occurring simultaneously are less likely to be masked
by data within the information matrix. That the likelihood of false positives decreases as
s (and consequently area) increases is a desirable outcome. Assuming a steady fault rate,
this implies that proportionally less time will be spent unnecessarily recomputing results
as s scales.
The occurrence of false negatives was found to decrease across all variables tested. f
and d had lower impacts upon the results than s and n, however, so the plots shown are
averaged across f and d. Not shown within the plots due to this averaging is the fact
that no false negatives were encountered across the range of variables tested when f = 1.
This is a powerful result, inferring that if single permanent faults can be either bypassed
or repaired before subsequent faults develop, it should be possible to operate indefinitely
without allowing incorrect results to go unnoticed. Zero false negatives were observed
for s = 32 with any n. This also held true for s ≥ 16 with n ≥ 4 and for s ≥ 8 with
n = 32. While perhaps somewhat high for the smallest s and n, it is encouraging that
for more reasonable values of those variables, i.e. ones for which hardware acceleration is
worthwhile, the likelihood of encountering false negatives due to multiple faults is low. In
the s = 8, n = 8, f = 2, d = 1 case, for example, this was approximately 0.001%.
Finally from Figure 5.1 it can be seen that fewer masked faults were encountered as
s and n increased. f and d are also not reflected here; they were again found to be of
little significance to the results. While their frequencies of occurrence dropped off less
sharply than those of false negatives, no masked faults were observed for s = 32 with any
n or for s = 16 with n = 32. As noted in Section 2.7.3, masked faults are purely data-
dependent: while it may be desirable to observe them from a fault detection perspective,
their occurrence in none of the experiments performed led to incorrectly computed data,
nor to having to recalculate any results.
Figure 5.2 shows the proportions of successfully located results for the same operator
under the same fault injection conditions. Note that the data these plots represent, along
with that for the remainder of located results’ plots in this chapter, was scaled such that
it excluded masked faults. This was done since masked faults represent results that can
88
30
40
50
60
70
80
90
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(a) Detected
0
10
20
30
40
50
60
70
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(b) False positive
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(c) False negative
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(d) Masked
Figure 5.1: Matrix-matrix multiplication permanent fault observability. Each plot shows
the proportion of total accelerator execution’s results that led to a particular
class of output for a particular combination of s, n, f and d.
(only be) ignored; they are by definition unlocatable. The results were found to be largely
independent of s and were thus averaged across the range of s tested. This is where the
advantage of d ≥ 1 becomes clear; successful fault location is more likely with multi-
ple independent checksums. This, coupled with the opposite dependency found for false
positives—where their likelihood scaled with d—implies a tradeoff: larger d allows more
faults to be targetted for repair at the expense of increased unnecessary recomputation.
Finally reflected here is the difficulty of locating simultaneous faults; as f increases, the
proportion of faults able to be located decreases.
The results obtained for matrix-matrix multiplication under transient fault injection,
presented in Figure 5.3, are rather different to those seen with permanent faults. Plots of
detected transient faults, shown here averaged across the range of n tested due to being
largely independent of its value, follow similar patterns to those for detected permanent
faults—proportions increasing with s and f but decreasing with d—but they reach clearly
defined upper limits. The reason for this is hinted at from the masked faults plots. Triv-
ially, individual bits are likely to be masked 50% of the time with uniformly distributed
89
50
55
60
65
70
75
80
85
90
95
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Data width n (bits)
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
Figure 5.2: Matrix-matrix multiplication permanent fault locatability. Each plot shows
the proportion of total accelerator execution’s results that gave a one-to-one
mapping between faulty MACs and incorrectly computed column-wise check-
sums for a particular combination of s, n, f and d.
data under the influence of single SA1 faults since a bit is 50% likely to already be high.
Hence, for f = 1, masked faults approach an upper limit of 50%. For f = 2 they ap-
proach 25%—expected here since data and fault locations between computations were
both independent—and, similarly, 12.5% for f = 3. The proportion of false positives
was found to decrease with s and f but increase with d, as for permanent faults. Their
dropoffs with f , in particular, are less pronounced. Zero false negatives were encountered
for matrix-matrix multiplication testing under transient fault injection across the range
of variables tested. These results are particularly encouraging, suggesting that transient
fault detection for matrix-matrix multiplication is hampered only by the occurrence of
false positives. In the s = 32, n = 32, f = 1, d = 1 case, however—representative of a
realistically sized implementation—the likelihood of their occurrence is only around 3%.
The results of the same operator’s locatability testing during transient fault injection
are shown in Figure 5.4. These are very different to those obtained under permanent fault
injection. In particular, localisation was successful 100% of the time when f = 1, however
it dropped sharply for f > 1. The data obtained was found to be largely independent of
90
10
20
30
40
50
60
70
80
90
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(a) Detected
0
10
20
30
40
50
60
70
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(b) False positive
−1
−0.5
0
0.5
1
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(c) False negative
0
5
10
15
20
25
30
35
40
45
50
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(d) Masked
Figure 5.3: Matrix-matrix multiplication transient fault observability. Each plot shows
the proportion of total accelerator execution’s results that led to a particular
class of output for a particular combination of s, n, f and d.
n, so was combined across its range of values. It can be seen that, in the transient fault
case, fault location becomes a more difficult problem with increasing s as well as f , and
that d has little impact upon its success, particularly for larger s.
5.4 Matrix Addition
Figure 5.5 represents the classification of results for matrix addition testing under the
presence of permanent faults. Side-by-side comparison of Figure 5.5 and Figure 5.1, for
matrix-matrix multiplication under the same fault conditions, shows that the results were
similar. Indeed, in the detected, false positive and false negative fault cases this is partic-
ularly true; the only difference of any significance is that the lower and upper bounds seen
for detected and false positive faults were fractionally lower and higher, respectively, for
matrix addition. Note that all plots shown here are presented in the same forms as the cor-
responding plots for matrix-matrix multiplication. The primary dissimilarity between the
two operators’ results can be seen in the plots of masked faults: where for matrix-matrix
91
10
20
30
40
50
60
70
80
90
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
Figure 5.4: Matrix-matrix multiplication transient fault locatability. Each plot shows the
proportion of total accelerator execution’s results that gave a one-to-one map-
ping between faulty MACs and incorrectly computed column-wise checksums
for a particular combination of s, n, f and d.
multiplication their likelihood dropped as n grew, here the opposite is true. Despite this,
no masked faults were encountered for s = 32 with any tested value of n.
The results of fault locatability testing for the same operator under the same fault
conditions are presented in Figure 5.6. These are also largely similar to those obtained
for matrix-matrix multiplication with permanent fault injection shown in Figure 5.2; this
was expected since the output matrices for the two operators are of the same form. The
primary difference between the two sets of plots is that upper limits appear to be lower
in the matrix addition case, indicating that faults are less likely to be successfully located
for that operator.
The classification of results obtained from matrix addition simulations performed under
the influence of transient faults is shown in Figure 5.7. These, as expected, are similar to
those seen in Figure 5.3 for matrix-matrix multiplication under the same fault conditions.
The explanation for the masked fault proportionalities observed under transient fault
injection given in Section 5.3 is exemplified even more clearly in Figure 5.7, where it can
be seen that the frequency of occurrence is exclusively dependent upon f . Zero false
92
30
40
50
60
70
80
90
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(a) Detected
0
10
20
30
40
50
60
70
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(b) False positive
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(c) False negative
0
0.5
1
1.5
2
2.5
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(d) Masked
Figure 5.5: Matrix addition permanent fault observability. Each plot shows the proportion
of total accelerator execution’s results that led to a particular class of output
for a particular combination of s, n, f and d.
negatives were encountered during matrix addition with transient fault injection testing.
The primary difference in both detected and false positive results obtained here and for
matrix-matrix multiplication under the same fault conditions was that the lower bounds
observed were lower in the matrix addition case.
The results of locatability testing for the same operator under transient fault injection
are shown in Figure 5.8. Note that these are comparable to those presented in Figure 5.4
for matrix-matrix multiplication under the same fault conditions. As was the case for that
operator, all faults were locatable when f = 1 and their proportions thereafter decreased
as f rose. The plots shown for f > 1 in the matrix addition case are shallower than
those in the matrix-matrix multiplication case, with lower bounds for low s but similar
locatability proportions for higher s.
93
50
55
60
65
70
75
80
85
90
95
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Data width n (bits)
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
Figure 5.6: Matrix addition permanent fault locatability. Each plot shows the proportion
of total accelerator execution’s results that gave a one-to-one mapping between
faulty adders and incorrectly computed column-wise checksums for a particular
combination of s, n, f and d.
94
010
20
30
40
50
60
70
80
90
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(a) Detected
0
10
20
30
40
50
60
70
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(b) False positive
−1
−0.5
0
0.5
1
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(c) False negative
10
15
20
25
30
35
40
45
50
55
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(d) Masked
Figure 5.7: Matrix addition transient fault observability. Each plot shows the proportion
of total accelerator execution’s results that led to a particular class of output
for a particular combination of s, n, f and d.
5.5 Matrix-vector Multiplication
Figure 5.9 summarises the results of matrix-vector multiplication simulations performed
under the presence of permanent faults. Although the proportions of detected faults
decreased with f and d, they were seen to be most dependent upon s and n; they are
therefore displayed after being averaged across f and d. Note that detectability also
increased with both s and n. The proportions of observed false positives did not vary
significantly with n, hence these are averaged across the range of results for that variable.
In line with previous operators, the occurrence of false positives was seen to increase
with d but decrease with s and f . False negatives were not observed for f = 1, but
are presented here averaged over both f and d since variations in s and n were more
significant. Unfortunately, the proportions of false negatives encountered increased with
s. They did, however, reduce with n, and increased with s less significantly as n fell in
magnitude. Masked faults, displayed here averaged over the range of s tested, were found
to be independent of d, decreasing with both n and f .
95
10
20
30
40
50
60
70
80
90
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
Figure 5.8: Matrix addition transient fault locatability. Each plot shows the proportion of
total accelerator execution’s results that gave a one-to-one mapping between
faulty adders and incorrectly computed column-wise checksums for a particular
combination of s, n, f and d.
Figure 5.10 presents the final fault injection simulation results; for the computation
of matrix-vector multiplications in the presence of transient faults. Results shown here
for detected, false positive and masked faults are combined across the range of n tested
since they were found to be largely independent of that variable’s value. False negative
proportions are given by s and n with results combined across the range of f and d for the
same reason. The asymptotic behaviour of the detected plots is largely dictated by the
numbers of masked faults encountered; these are similar to those seen for transient fault
injection in both the matrix-matrix multiplication and matrix addition cases. Masked
fault occurrences can be seen to be independent of d, with their upper bounds dependent
upon f . s can be seen to have little impact upon the likelihood of encountering false
negative results, particularly for larger n.
5.6 Conclusion
In this chapter, the results of fault injection simulations performed upon a trio of ABFT-
protected linear algebra operators—matrix-matrix multiplication, matrix addition and
96
50
55
60
65
70
75
80
85
90
95
100
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(a) Detected
0
10
20
30
40
50
60
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(b) False positive
0
2
4
6
8
10
12
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(c) False negative
0
5
10
15
20
25
30
35
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Data width n (bits)
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(d) Masked
Figure 5.9: Matrix-vector multiplication permanent fault observability. Each plot shows
the proportion of total accelerator execution’s results that led to a particular
class of output for a particular combination of s, n, f and d.
matrix-vector multiplication—across a range of implementational parameters and operat-
ing conditions were presented. Details of the software framework developed to simulate
those operations under the influence of both permanent and transient faults were also
given. Analysis of the results obtained suggested high fault tolerance across the three op-
erators, the majority of which improve as hardware compexity grows. The results cement
ABFT’s status as a credible alternative to more established fault tolerance techniques
despite its comparatively low overheads, particularly in terms of area.
Of the variables considered, d has proven to be somewhat disappointing: small, if any,
gains were realised for distance-x, with x > 2, checksumming. Consequently d will remain
fixed at 1 for the work presented in Chapter 6.
97
20
30
40
50
60
70
80
90
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(a) Detected
0
5
10
15
20
25
30
35
40
45
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(b) False positive
0
1
2
3
4
5
6
7
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
n = 2
n = 4
n = 8
n = 16
n = 32
(c) False negative
0
5
10
15
20
25
30
35
40
45
50
2 4 8 16 32
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Matrix size s
f = 1, d = 1
f = 1, d = 2
f = 1, d = 3
f = 2, d = 1
f = 2, d = 2
f = 2, d = 3
f = 3, d = 1
f = 3, d = 2
f = 3, d = 3
(d) Masked
Figure 5.10: Matrix-vector multiplication transient fault observability. Each plot shows
the proportion of total accelerator execution’s results that led to a particular
class of output for a particular combination of s, n, f and d.
98
Algorithm 7 Matrix-matrix multiplication fault injection simulation
1: create s× s matrix A
2: create s× s matrix B
3: A.rand fill()
4: B.rand fill()
5: create s× s matrix Cref
6: Cref ← AB
7: create (s+ d)× s matrix Ac
8: create s× (s+ d) matrix Br
9: Ac ← A.add cs(‘col’, d)
10: Br ← B.add cs(‘row’, d)
11: create (s+ d)× (s+ d) matrix Cf
12: if fault type = ‘perm’ then
13: bit mask, faulty cols← generate bit mask(s, n, d, 1, f)
14: for i = 0 to s+ d− 1 do
15: for j = 0 to s+ d− 1 do
16: Cf[i][j]← bit mask[j]
17: for k = 0 to s+ d− 2 do
18: Cf[i][j]←
(
(Cf[i][j] +Ac[i][k]×Br[k][j]) mod 2
n
)
bitwise or bit mask[j]
19: end for
20: end for
21: end for
22: else
23: bit mask, faulty cols← generate bit mask(s, n, d, 3, f)
24: for i = 0 to s+ d− 1 do
25: for j = 0 to s+ d− 1 do
26: Cf[i][j]← bit mask[i][j][0]
27: for k = 0 to s+ d− 2 do
28: Cf[i][j] ←
(
(Cf[i][j] + Ac[i][k] × Br[k][j]) mod 2
n
)
bitwise or
bit mask[i][j][k + 1]
29: end for
30: end for
31: end for
32: end if
33: data ok ← Cf.get data() = Cref
34: cs ok ← Cf.check cs()
35: result type← classify result(data ok, cs ok)
36: located← Cf.diagnose cs() = faulty cols
99
Algorithm 8 generate bit mask() procedure used in fault injection simulations
Require: s, n, d, dimensions, f
1: create s+ d vector faulty cols
2: if dimensions = 1 then
3: create s+ d vector bit mask
4: for i = 0 to f − 1 do
5: repeat
6: col = rand(0 to s+ d)
7: bit = rand(0 to n− 1)
8: until not bit mask[col] bitwise and 2bit
9: bit mask[col]← bit mask[col] bitwise or 2bit
10: faulty cols[col]← true
11: end for
12: else if dimensions = 2 then
13: create (s+ d)× (s+ d) matrix bit mask
14: for i = 0 to f − 1 do
15: repeat
16: col = rand(0 to s+ d)
17: step = rand(0 to s+ d)
18: bit = rand(0 to n− 1)
19: until not bit mask[col][step] bitwise and 2bit
20: bit mask[col][step]← bit mask[col][step] bitwise or 2bit
21: faulty cols[col]← true
22: end for
23: else
24: create (s+ d)× (s+ d)× (s+ d) 3D matrix bit mask
25: for i = 0 to f − 1 do
26: repeat
27: row = rand(0 to s+ d)
28: col = rand(0 to s+ d)
29: step = rand(0 to s+ d)
30: bit = rand(0 to n− 1)
31: until not bit mask[row][col][step] bitwise and 2bit
32: bit mask[row][col][step]← bit mask[row][col][step] bitwise or 2bit
33: faulty cols[col]← true
34: end for
35: end if
36: return bit mask, faulty cols
100
Algorithm 9 classify result() procedure used in fault injection simulation
Require: data ok, cs ok
1: if data ok and cs ok then
2: result type← ‘masked’
3: else if data ok and not cs ok then
4: result type← ‘false pos’
5: else if not data ok and cs ok then
6: result type← ‘false neg’
7: else
8: result type← ‘detected’
9: end if
10: return result type
Algorithm 10 Matrix addition fault injection simulation
1: create s× s matrix A
2: create s× s matrix B
3: A.rand fill()
4: B.rand fill()
5: create s× s matrix Cref
6: Cref ← A+B
7: create (s+ d)× (s+ d) matrix Af
8: create (s+ d)× (s+ d) matrix Bf
9: Af ← A.add cs(‘full’, d)
10: Bf ← B.add cs(‘full’, d)
11: create (s+ d)× (s+ d) matrix Cf
12: if fault type = ‘perm’ then
13: bit mask, faulty cols← generate bit mask(s, n, d, 1, f)
14: for i = 0 to s+ d− 1 do
15: for j = 0 to s+ d− 1 do
16: Cf[i][j]←
(
(Af[i][j] +Bf[i][j]) mod 2
n
)
bitwise or bit mask[j]
17: end for
18: end for
19: else
20: bit mask, faulty cols← generate bit mask(s, n, d, 2, f)
21: for i = 0 to s+ d− 1 do
22: for j = 0 to s+ d− 1 do
23: Cf[i][j]←
(
(Af[i][j] +Bf[i][j]) mod 2
n
)
bitwise or bit mask[i][j]
24: end for
25: end for
26: end if
27: data ok ← Cf.get data() = Cref
28: cs ok ← Cf.check cs()
29: result type← classify result(data ok, cs ok)
30: located← Cf.diagnose cs() = faulty cols
101
Algorithm 11 Matrix-vector multiplication fault injection simulation
1: create s vector a
2: create s× s matrix B
3: a.rand fill()
4: B.rand fill()
5: create s vector cref
6: cref ← aB
7: create s× (s+ d) matrix Br
8: Br ← B.add cs(‘row’, d)
9: create s+ d vector cr
10: if fault type = ‘perm’ then
11: bit mask, faulty cols← generate bit mask(s, n, d, 1, f)
12: for j = 0 to s+ d− 1 do
13: cr[j]← bit mask[j]
14: for k = 0 to s+ d− 2 do
15: cr[j]←
(
(cr[j] + a[k]×Br[k][j]) mod 2
n
)
bitwise or bit mask[j]
16: end for
17: end for
18: else
19: bit mask, faulty cols← generate bit mask(s, n, d, 2, f)
20: for j = 0 to s+ d− 1 do
21: cr[j]← bit mask[j][0]
22: for k = 0 to s+ d− 2 do
23: cr[j]←
(
(cr[j] + a[k]×Br[k][j]) mod 2
n
)
bitwise or bit mask[j][k + 1]
24: end for
25: end for
26: end if
27: data ok ← cr.get data() = cref
28: cs ok ← cr.check cs()
29: result type← classify result(data ok, cs ok)
102
6 Reduced-precision Algorithm-based
Fault Tolerance
6.1 Introduction
In this chapter, research into the application of algorithm-based fault tolerance (ABFT)
in an field-programmable gate array (FPGA)-implemented accelerator at reduced levels of
precision is presented. This allows for the introduction of a previously unexplored tradeoff:
sacrificing some observability, preferably of faults associated with low-magnitude errors,
for gains in area, performance and efficiency by reducing the bit-widths of logic used for
error detection. The implementation of two distinct truncation techniques is described,
with their effects upon overheads and allowed data error compared. The methods intro-
duced in this chapter lend themselves to FPGAs thanks to their efficient simultaneous
implementation of multiple arbitrary-precision datapaths. Here, as a case study for the
investigation into reduced-precision ABFT, hardware-accelerated matrix multiplication is
called upon once more.
While previous fixed-point ABFT-related work has assumed all data and checksums to
be data width (n)-bit integer (i.e. modulo-2n), it is possible to break this relationship
and consider data and checksum precision independently. By making informed decisions
regarding exactly which information to discard when forming and manipulating check-
sums, the incurred overheads can be reduced at the cost of accepting some degree of data
error. This is achieved by effectively bounding allowed data error: in this chapter, levels
of checksum truncation are used to infer maximum error propagation, however such a
derivation could equally be performed in reverse.
6.1.1 Contributions
The original contributions of the work presented in this chapter are:
103
• The first consideration of distinct data and checksum bit-widths within ABFT-
protected operations: reduced-precision algorithm-based fault tolerance (RP-ABFT).
• The first implementation of circuitry incorporating RP-ABFT for resilience against
hardware faults.
• Analysis of the costs and benefits of applying two forms of RP-ABFT to various
precisions.
• Insight into the hardware fault tolerance of RP-ABFT.
6.1.2 Publications
The work presented in this chapter has been peer-reviewed and will appear in the 2016 pro-
ceedings of the International Workshop on Applied Reconfigurable Computing (ARC) [75].
6.1.3 Outline
The remainder of this chapter is organised as follows. Section 6.2 describes the math-
ematical principles behind RP-ABFT for the two proposed truncation techniques, with
Sections 6.2.1 and 6.2.2 covering truncation from the most-significant bit (MSB) and least-
significant bit (LSB) first, respectively. Impelementational details are given in Section 6.3,
with the small modifications made to the baseline architectures used in Chapters 3 and 4
described first in Section 6.3, followed by those needed to implement the two flavours
of RP-ABFT in Sections 6.3.2 and 6.3.3. In Section 6.4, the overheads associated with
RP-ABFT are analysed, focussing upon area and performance in Sections 6.4.1 and 6.4.2,
respectively. Section 6.5 presents analysis of the impacts the selection of the various im-
plementational options introduced by RP-ABFT has upon fault observability within the
targetted datapath, while Section 6.6 gives concluding remarks.
6.2 Principles of RP-ABFT
6.2.1 MSB-first Truncation
During checksum generation and verification, data element bits from the most-significant
downwards can be sacrificed in order to reduce the size of the logic required to manipulate
them into checksums. All input data elements are n-bit signed integers and the number
of bits of precision removed from each during checksum generation is represented by the
104
truncation width (r). Output data elements are always 2n-bit; this departure from the
norm of all-n bit is elaborated upon in Section 6.3.1. In MSB-first truncation, therefore,
input checksums are reduced to the least significant n−r bits of precision. This also limits
the precision of output checksums to the least significant n− r bits.
6.2.2 LSB-first Truncation
To maintain sensitivity to faults that cause high-magnitude errors, it is possible to perform
truncation from the least, rather than most, significant bits of data elements when forming
and manipulating checksums. The constructions of the inputs and outputs of ABFT-
protected matrix multiplication are shown in Equation 6.1. Each element is marked as
either an input data element (din), output data element (dout), input checksum element
(csin), output checksum element (csout) or corner output checksum element (csout, c), as
appropriate. csout, c is a special form of output checksum: that which is itself formed
exclusively from csin elements. The ‘c’ in csout, c indicates corner.


din · · · din
...
. . .
...
din · · · din
csin · · · csin




din · · · din csin
...
. . .
...
...
din · · · din csin

 =


dout · · · dout csout
...
. . .
...
...
dout · · · dout csout
csout · · · csout csout, c


(6.1)
Symbols for maximum absolute value (∨) and maximum absolute error (ǫ) must be
introduced next. ∨(din) is as defined in Equation 6.2. The r-bit LSB-first truncation of a
din element, performed with bitwise shifts as (din ≫ r)≪ r, is represented as ⌊din⌋r since
rounding, for both positive and negative values, is towards negative infinity. Note that, as
also shown in Equation 6.2, ∨(⌊din⌋r) = ∨(din); the maximum negative value, for which
truncation by any 0 ≤ r < n will have no effect, also represents the maximum absolute
value. ǫ(⌊din⌋r) is as defined in Equation 6.3.
∨(⌊din⌋r) = ∨(din) = 2
n−1 (6.2)
ǫ(⌊din⌋r) = 2
r − 1 (6.3)
105
Each csin element is formed from square matrix size (s) din elements, each independently
truncated, as shown in Equation 6.4. ∨ and ǫ of each csin element are therefore trivial to
calculate, as shown in Equations 6.5 and 6.6, respectively.
csin = ⌊din⌋r + · · ·+ ⌊din⌋r (6.4)
∨(csin) = s∨(⌊din⌋r) = s2
n−1 (6.5)
ǫ(csin) = sǫ(⌊din⌋r) = s(2
r − 1) (6.6)
csout elements are comprised of s multiplied pairs of din and csin, summed as shown in
Equation 6.7. Since the din element used within each multiplication is not truncated, it
does not introduce error: this comes purely from each csin, so ǫ(csout) is found from ∨ of
each din and ǫ of each csin, as shown in Equation 6.8.
csout = dincsin + · · ·+ dincsin (6.7)
ǫ(csout) = s∨(din)ǫ(csin) = s
2(2r − 1)2n−1 (6.8)
The csout, c element is formed of s multiplied pairs of csin elements, summed as shown
in Equation 6.9. Unlike for each csout element, therefore, error can be introduced by both
of the multiplicands within each product. As a result, the combination of worst-case error
from both csin elements, including their cross-product, within each multiplication must be
taken into account when quantifying ǫ(csout, c): the result is given in Equation 6.10.
csout, c = csincsin + · · ·+ csincsin (6.9)
ǫ(csout, c) = s
(
∨(csin)ǫ(csin) + ǫ(csin)∨(csin) + ǫ(csin)
2)
= s3(2r − 1)(2n + 2r − 1)
(6.10)
106
6.3 Implementation
6.3.1 Baseline Architecture
The baseline architecture used for comparison in this chapter is structurally identical to
that shown in Figure 4.5, less the reconfigurable routing regions. It is shown in Figure 6.1.
The principle difference between this architecture and those described in Chapters 3, 4
and 5 is the output data width: while in previous chapters the output data was always
assumed to be n-bit, the same as input data, here the output data is expanded to 2n-bit
in order to allow the two truncation types developed in this chapter to be compared.
In
p
u
t
R
A
M
2s× ns
n
s
C
h
ec
k
su
m
g
en
er
a
ti
o
n
lo
g
ic
Ac
n
+
⌈l
o
g
2
(s
)⌉
( n
+
⌈l
o
g
2
(s
)⌉
) (
s
+
1
)
Br
b
× + b
b
× + b
.
.
.
.
.
.
.
.
.
× + b
Cf
2
n
(s
+
1
)
O
u
tp
u
t
R
A
M
(s+ 1)×
2n(s+ 1)
C
h
ec
k
su
m
v
er
ifi
ca
ti
o
n
lo
g
ic
2
n
(s
+
1
)
Figure 6.1: Datapath with zero truncation. While in previous chapters the output data
was always assumed to be n-bit, the same as input data, here the output data
is expanded to 2n-bit in order to allow the two truncation types developed in
this chapter to be compared.
Checksum generation and verification logic, shown in Figures 6.2 and 6.3, respectively,
serves to perform the normal ABFT procedures described in Section 3.2.3.
I
n
p
u
t
R
A
M
n
s
b
ns s : 1
n
b
+ b
b
csc RAM
s× w
b
csr RAM
s× w
w
Ac
w
(s
+
1
)
Br
w = n+ ⌈log2(s)⌉
Figure 6.2: Checksum generation logic with zero truncation, used to perform the ‘normal’
ABFT checksum generation procedures.
107
Cf
2
n
(s
+
1
)
(s+ 1) : 1
2
n
b
b
+ b =
b
· · ·
· · ·
s
+
1
c
s
r
s
O
K
+
c
s
c
R
A
M
(s+ 1)× 2n
b =
· · ·
· · ·
s
+
1
c
s
c
s
O
K
Figure 6.3: Checksum verification logic with zero truncation, used to perform the ‘normal’
ABFT checksum verification procedures.
6.3.2 MSB-first Truncation
Modifications to the logic shown in Figures 6.2 and 6.3 needed to achieve the MSB-first
truncation explained in Section 6.2.1 are straightforward since only bit-widths change.
These changes have some knock-on effects on the top-level overview shown in Figure 6.1:
now, rather than n+ log2(s), the per-element paths for Ac and Br only have to be n bits
wide and the output taken from the output block random-access memory (BRAM) only
has to be n − r, rather than 2n, bits per element. These changes result in the modified
datapath shown in Figure 6.4 and the checksum generation and verification logic shown in
Figures 6.5 and 6.6, respectively. Note that the signs of all elements are preserved, despite
the moniker ‘MSB-first truncation.’
In
p
u
t
R
A
M
2s× ns
n
s
C
h
ec
k
su
m
g
en
er
a
ti
o
n
lo
g
ic
Ac
n
n
(s
+
1
)
Br
b
× + b
b
× + b
.
..
.
..
.
..
× + b
Cf
2
n
(s
+
1
)
O
u
tp
u
t
R
A
M
(s+ 1)×
2n(s+ 1)
C
h
ec
k
su
m
v
er
ifi
ca
ti
o
n
lo
g
ic
(n
−
r
)(
s
+
1
)
Figure 6.4: Datapath with MSB-first truncation. Rather than n+log2(s), the per-element
paths for Ac and Br only have to be n bits wide and the output taken from
the output BRAM only has to be n− r, rather than 2n, bits per element.
108
I
n
p
u
t
R
A
M
n
s
b
ns s : 1
n
b
n− r
+ b
b
csc RAM
s× (w − r)
b
csr RAM
s× (w − r)
w
Ac
w
(s
+
1
)
Br
w = n+ ⌈log2(s)⌉
Figure 6.5: Checksum generation logic with MSB-first truncation. Compared to the zero-
truncation implementation, only bit-widths are different.
Cf
(n
−
r
)(
s
+
1
)
(s+ 1) : 1
n
−
r
b
b
+ b =
b
· · ·
· · ·
s
+
1
c
s
r
s
O
K
+
c
s
c
R
A
M
(s+ 1)×
(n− r)
b =
· · ·
· · ·
s
+
1
c
s
c
s
O
K
Figure 6.6: Checksum verification logic with MSB-first truncation. Compared to the zero-
truncation implementation, only bit-widths are different.
6.3.3 LSB-first Truncation
Achieving the LSB-first truncation described in Section 6.2.2 requires different logic to
that shown in Figures 6.2 and 6.3; the blocks shown in Figures 6.8 and 6.9 stand in their
places. In terms of changes to the top-level overview, Figure 6.1, the per-element paths for
Ac and Br are
(
n+max
(
log2(s)−r, 0
))
-bit, rather than
(
n+log2(s)
)
-bit, to optimally fit
the single largest data or checksum element. These changes are represented in Figure 6.7.
Output checksum error must be tolerated up to the levels theorised in Equations 6.8
and 6.10 as a result of the truncation performed. Clearly, there is no reason to actually
perform the left-shifting shown in the explanation of the truncation procedure; for this
reason, the output checksum element error threshold (θ) and corner output checksum ele-
109
In
p
u
t
R
A
M
2s× ns
n
s
C
h
ec
k
su
m
g
en
er
a
ti
o
n
lo
g
ic
Ac
w
w
(s
+
1
)
Br
b
× + b
b
× + b
.
.
.
.
.
.
.
.
.
× + b
Cf
2
n
(s
+
1
)
O
u
tp
u
t
R
A
M
(s+ 1)×
2n(s+ 1)
C
h
ec
k
su
m
v
er
ifi
ca
ti
o
n
lo
g
ic
2
n
(s
+
1
)
w = n+max
(
⌈log2(s)⌉ − r, 0
)
Figure 6.7: Datapath with LSB-first truncation. The per-element paths for Ac and Br
are
(
n+max
(
log2(s)− r, 0
))
-bit, rather than
(
n+ log2(s)
)
-bit, to optimally
fit the single largest data or checksum element.
ment error threshold (θc), shown in Figure 6.9 for csout and csout, c elements, respectively,
need to be based upon, not equal to, ǫ(csout) and ǫ(csout, c). csout elements each have
their widths reduced by r bits due to the right-shifter shown in Figure 6.9; as a result,
θ is calculated as shown in Equation 6.11. csout, c elements, however, are constructed
exclusively from csin elements: they are therefore subject to magnitude reduction by the
right-shifters shown in both Figures 6.8 and 6.9. θc is therefore calculated as shown in
Equation 6.12.
θ =
ǫ(csout)
2r
=
s2(2r − 1)2n−1
2r
≈ s22n−1 (6.11)
θc =
ǫ(csout, c)
22r
=
s3(2r − 1)(2n + 2r − 1)
22r
≈ s32n−r (6.12)
6.4 Overheads
Designs were implemented using Xilinx Vivado 2014.4 across the following range of im-
plementation variables:
• Target device: Xilinx Zynq-7000 XC7Z020.
• s: {2, 4, 8, 16, 32}.
• n (bits): 32 (signed, fixed-point).
110
I
n
p
u
t
R
A
M
n
s
b
ns s : 1
n
b
≫ r
n− r
+ b
b
csc RAM
s× w1
b
csr RAM
s× w1
w
2
Ac
w
2
(s
+
1
)
Br
w1 = n− r + ⌈log2(s)⌉
w2 = n+max
(
⌈log2(s)⌉ − r, 0
)
Figure 6.8: Checksum generation logic with LSB-first truncation. Truncation is performed
by the r-bit right-shifter shown.
• Truncation type: {none, MSB-first, LSB-first}.
• r (bits): {0, 4, 8, 12, 16, 20, 24}.
• Data memory resource type: BRAM.
• Checksum memory resource type: distributed random-access memory (RAM).
• Multiplier resource type: digital signal processing block (DSP).
• Multiplier latency (m) (cycles): 15.
• Accumulator latency (a) (cycles): 1.
Note that the majority of parameters—s, n, resource types, m and a—were kept the same
as those used in the experiments described in Chapters 3 and 4 for the sake of comparison.
6.4.1 Area
Tables 6.1, 6.2, 6.3 and 6.4 contain the raw area utilisation figures obtained for all imple-
mentations of the accelerator expressed in terms of look-up tables (LUTs), flip-flops (FFs),
BRAMs, DSPs and total (combined) resources used. The latter is the mean of the four
preceeding proportions, intended to give an indication as to the overall resource utilisation
for a particular design. Absolute and relative (percentage) values are given, the latter in-
dicating proportions of resources used on the target device. Figures 6.10 and 6.11 show the
combined area overhead versus the equivalently sized unprotected design, i.e. that without
111
Cf
2
n
(s
+
1
)
(s+ 1) : 1
2
n
b
≫
r 2
n
−
r
b + b −
b
| | <
· · ·
· · ·
s
+
1
c
s
r
s
O
K
+
c
s
c
R
A
M
(s+ 1)×
(2n− r)
b − | | <
· · ·
· · ·
s
+
1
c
s
c
s
O
K
θ
θc
b
Figure 6.9: Checksum verification logic with LSB-first truncation. Truncation is performed
by the r-bit right-shifter shown.
incorporated ABFT, for RP-ABFT with MSB-first truncation, while Figures 6.12 and 6.13
show the same for LSB-first truncation. Note that r = 0 in both MSB- and LSB-first cases
refers to the same design with ABFT protection but zero truncation applied.
The MSB-first truncation results shown in Figures 6.10 and 6.11 demonstrate clear
area gains for this truncation method. Thanks mostly to the severe effect of the output
truncation described in Section 6.2.1 for any r > 0, a dramatic drop in resource usage is
seen after r = 0 for any s. In the most extreme case tested, for s = 32 and r = 28, overhead
drops from 15.2% to just 2.53%: an 83.3% reduction. The changes in overhead seen for
LSB-first truncation, on the other hand, are more complex: as Figures 6.12 and 6.13 show,
overheads initially increase in all cases other than s = 32. This is primarily due to the
introduction of the subtractors shown in Figure 6.9. Gains are realised in the s = 16 case
for r ≥ 8, r ≥ 16 in the s = 8 case and r ≥ 20 in the remaining two. The maximum
area overhead reduction, again for s = 32 and r = 28, was 23.8%: less than the equivalent
MSB-first truncation’s, but still not insignificant.
6.4.2 Performance
For the same set of designs, the reported timing model-inferred maximum operating fre-
quency (fmax) was also recorded. The raw results obtained are detailed in Table 6.5,
with changes versus the equivalently sized unprotected designs for MSB- and LSB-first
truncation, in the same style as those produced for area in Figures 6.10 and 6.12, shown
in Figures 6.14 and 6.15, respectively. To overcome the effects of computer-aided de-
sign (CAD) noise, trendlines are included for each plot, shown as dashed lines. For the
112
020
40
60
80
100
120
140
160
180
200
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) LUT
10
15
20
25
30
35
40
45
50
55
60
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) FF
−1
−0.5
0
0.5
1
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) BRAM
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) DSP
Figure 6.10: RP-ABFT with MSB-first truncation-protected accelerator resource usage
overhead versus baseline, showing the change in individual resource util-
isations for each design versus its equivalently sized unprotected, baseline
implementation.
MSB-first truncation plots shown in Figure 6.14, these begin at r = 4 due to the disconti-
nuities expected prior to that value. Note that neither the type of RP-ABFT nor r affects
the (clock cycle) latency of a design versus its standard ABFT equivalent, allowing fmax
to be used for performance comparison directly.
Due primarily to the wide (64-bit) adders needed to enable checksum verification, fmax
reductions are significant in the zero truncation case: for s = 32, fmax drops by 40.8%.
Such penalties are practically eliminated, and in a few cases become net gains, through
the use of MSB-first truncation; this is shown in Figure 6.14. The large jump seen for all
s to near-zero is due to the elimination of at least 50% of the output bits during checksum
verification explained in Section 6.2.1. To make this point clearer, Table 6.6 lists the input
and output checksum widths for each of the designs compiled. Recall that n = 32 in all
cases. As can be seen, the verification logic’s data width decreases from 64 to just 28
bits for r = 4, and reduces further thereafter: the regression lines demonstrate small but
fairly consistent gains. LSB-first truncation designs exhibit relatively small performance
113
05
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
Figure 6.11: RP-ABFT with MSB-first truncation-protected accelerator combined re-
source usage overhead versus baseline, showing the change in combined re-
source utilisation for each design versus its equivalently sized unprotected,
baseline implementation.
improvements: for s = 32, a drop in frequency impact of 7.23% was found. Although
trends for smaller s are actually negative, those for larger s are positive. This is a result
of the lack of severe output truncation, as shown in Table 6.6, and the introduction of
additional logic as described in Section 6.4.1. Nevertheless, frequency gains were realised
for larger designs, with the s = 32 case showing increasing gains for each value of r tested
but the last.
6.5 Fault Observability
Functional simulations were performed in software to assess the fault observability of the
proposed designs across the following range of variables:
• Fault type: {permanent, transient}.
• s: {2, 4, 8, 16, 32}.
• n (bits): {2, 4, 8, 16, 32}.
• Truncation type: {none, MSB-first, LSB-first}.
114
100
120
140
160
180
200
220
240
260
280
300
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) LUT
25
30
35
40
45
50
55
60
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) FF
−1
−0.5
0
0.5
1
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) BRAM
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) DSP
Figure 6.12: RP-ABFT with LSB-first truncation-protected accelerator resource usage
overhead versus baseline, showing the change in individual resource util-
isations for each design versus its equivalently sized unprotected, baseline
implementation.
• r (bits): {0, 4, 8, 12, 16, 20, 24}.
For each combination of these, the steps detailed in Algorithm 12 were repeated 1,024,000
times, with results averaged thereafter.
In all cases, the fault model applied was that of individually targetted stuck-at-one
(SA1) accumulator output bits. For permanent faults, such SA1s are representative of,
for example, worn transistors or bridged interconnects, while they mimic effects including
register and memory upsets in the case of transients.
Results gleaned from this testing were, as is the case in the all of the fault observability
testing performed in this thesis, independent of fault rate, area and latency: they demon-
strate the proportions of total accelerator executions that should be expected to result in
particular classes of outputs under fixed fault conditions. They cannot be used to directly
ascertain the expected rates of certain output classes’ occurrence, although this can be
achieved by scaling the proportions to take fault rate, area and/or latency, as required,
into account.
115
10
15
20
25
30
35
40
45
50
55
60
0 4 8 12 16 20 24 28
∆
ar
ea
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
Figure 6.13: RP-ABFT with LSB-first truncation-protected accelerator combined resource
usage overhead versus baseline, showing the change in combined resource
utilisation for each design versus its equivalently sized unprotected, baseline
implementation.
Algorithm 12 is largely similar to that used for matrix-matrix multiplication with em-
ulated fault injection described in Section 5.2 and detailed in Algorithm 7. Changes were
made to suit the introduction of r, and distance (d) was fixed at 1 since only a single
row or column of checksumming was used within the hardware described in Section 6.3.
Checksumming is added to A (Line 9) and B (Line 10) to form Ac and Br as described
in Section 6.2.1 (for MSB-first truncation) or Section 6.2.2 (for LSB-first) via procedure
call add cs(). When multiply-accumulation takes place (Line 18 for permanent fault
injection; Line 28 for transient), each step is performed modulo-22n, rather than 2n, as
explained in Section 6.3.1. Following computation, procedures check cs() (Line 34) and
diagnose cs() (Line 36) are called. While equivalent in purpose, these are somewhat
more complex than their equivalents in Algorithm 7 since they have to be capable of
operating with both truncation types.
Procedure add cs() is described in Algorithm 13. For column-wise checksum genera-
tion, starting from Line 1, a new matrix is provisioned (Line 2) with an additional row
for its checksums, its uppermost s rows being copied from the source matrix (Line 3). For
116
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
5
10
0 4 8 12 16 20 24 28
∆
f m
a
x
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
Figure 6.14: RP-ABFT with MSB-first truncation-protected accelerator fmax versus base-
line, showing the change in execution time for each design versus its equiv-
alently sized unprotected, baseline implementation. Trendlines are included
to counter CAD noise.
each column, the appropriate checksum element is zeroed (Line 5) and then accumulated
into, once per row of the source matrix, depending on the type of truncation required
as specified by variable truncation type. If no truncation is needed, accumulations are
carried out modulo-22n (Line 8). For MSB-first truncation, they are instead performed
modulo-2n−r, as explained in Section 6.2.1, while for LSB-first truncation each data ele-
ment of the source matrix is right-shifted by r bits prior to each accumulation (Line 12), as
explained in Section 6.2.2. To prevent overflow, no modulus is used with LSB-first trunca-
tion. One of the reasons Python was chosen as the language with which to implement the
framework was its default use of arbitrary-precision integers [76], thus preventing trans-
parent overflow. For row-wise checksum generation (from Line 17), the steps are identical
to those used for column-wise generation but with row and column indices reversed. Full
checksum generation (from Line 33) is achieved by performing the steps for both column-
and row-wise checksumming. The order this is completed in is irrelevant.
Algorithm 14 details the steps performed in procedure check cs(). To verify the check-
sums, those present within Cf are removed, leaving only data elements, when forming
117
−42
−40
−38
−36
−34
−32
−30
0 4 8 12 16 20 24 28
∆
f m
a
x
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
Figure 6.15: RP-ABFT with LSB-first truncation-protected accelerator fmax versus base-
line, showing the change in execution time for each design versus its equiv-
alently sized unprotected, baseline implementation. Trendlines are included
to counter CAD noise.
Ccopy (Line 2). Ccopy is expanded through the addition of both column- and row-wise
checksums into matrix Ccopy, f (Line 4). Column- and row-wise checksums within Cf and
Ccopy, f are then compared in turn. Vectors cs and cscopy, containing the appropriate
checksum elements from Cf and Ccopy, f, respectively, are created (Lines 8 and 9). If
truncation type indicates that no truncation was performed, cs and cscopy are simply
compared (Line 11). For MSB-first truncation, elements of cs are compared with those
in cscopy after their top n+ r bits have been discarded (Line 16); checksumming is blind
to these bits. For LSB-first truncation, absolute values of differences between checksum
elements are computed and compared to error values θ (Line 21, for non-corner elements)
or θc (Line 22, for the corner element), as derived in Section 6.3.3 and shown in Equa-
tions 6.11 and 6.12, respectively, as appropriate. Element-wise comparison is performed
either modulo-22n−r (Line 24, for non-corner elements) or modulo-22(n−r) (Line 28, for
the corner element) to account for the number of bits remaning post-right shifting. For
non-corner elements this is 2n − r since right-shifting by r bits is performed once during
their computation, while for the corner element it is 2(n − r) since r-bit right-shifting
118
happens twice.
Figures 6.16, 6.17, 6.18 and 6.19 respectively show the results of this testing for the pro-
portions of detected, false positive, false negative and masked faults for each combination
of truncation and fault type.
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.16: Detected fault proportions for RP-ABFT-protected accelerator. Each plot
shows the proportion of total accelerator execution’s results that led to a
successfully detected output for a particular combination of truncation type,
s and r.
In the broadest sense, the results show that it becomes increasingly common for faults
to be masked or missed entirely (false negative) than detected or flagged erroneously (false
positive) as r grows in all cases. Shapes and proportions of results falling into each of the
four result categories are largely the same for both MSB- and LSB-first truncation since
the error thresholds, explained in Sections 6.2.2 and 6.3.3, used for LSB-first truncation
are not considered when comparing actual and expected output data. Transients exhibit
around half the likelihood of detection due to the fact that they are expected to be masked
in 50% of occasions as a result of the uniformity of the input data. Results tend to be
more positive for larger s since the likelihood of faults being obscured by all surrounding
data decreases as the quantity of that data increases.
119
05
10
15
20
25
30
35
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
5
10
15
20
25
30
35
40
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
5
10
15
20
25
30
35
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
5
10
15
20
25
30
35
40
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.17: False positive fault proportions for RP-ABFT-protected accelerator. Each
plot shows the proportion of total accelerator execution’s results that led to
a false positive output for a particular combination of truncation type, s and
r.
Perhaps the most interesting feature of these results is the shape of the plots seen for
LSB-first truncation under permanent fault injection. Similarly to other plots, jumps
from desirable (high for detected, low for false negative, etc.) to undesirable proportions
of each result type are seen between r = 0 and r > 0, reflecting the inability to detect low-
magnitude errors that begins at r = 1. After this, however, plots remain largely flat until
r = 16, trending in the same directions as they did to begin with thereafter. It is around
this second inflection that the detection logic starts to become ineffective. Consider s = 32
in Equation 6.10. Setting ǫ(csout, c) = 2
63, i.e. ∨(dout) for n = 32, reveals that at r ≈ 16
corner checksums cease to be effective. Similarly, setting ǫ(csout) = 2
63 in Equation 6.8
for the same s and n shows that all checksumming is rendered useless at r ≈ 22.
The ability to locate faults, represented by the plots shown in Figure 6.20, is important
for applications requiring correction as well as detection. The accelerator relies upon
column checksum information to infer fault location since, as explained in Section 3.2.4,
output matrix columns have a one-to-one mapping to the multiply-accumulators (MACs)
120
010
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.18: False negative fault proportions for RP-ABFT-protected accelerator. Each
plot shows the proportion of total accelerator execution’s results that led to
a false negative output for a particular combination of truncation type, s and
r.
used to compute their elements. As for the previously discussed result classes, the results
show discontinuities—due to the harsh initial output truncation or introduction of error
bounding in MSB- and LSB-first truncation, respectively—followed by steady decreases as
r rises. Instances of masking were ignored when determining locatable faults since masked
results are, by definition, unlocatable.
Results flagged as false negatives were analysed to determine the magnitude of allowed
error in the case of faults being missed. In each of those cases, the maximum absolute
error of each of the output matrix’s data elements was calculated, and Figure 6.21 shows
the means of those results. Assuming that unmissed results are able to be corrected, Fig-
ure 6.21’s results therefore represent the average expected worst-element errors introduced
by RP-ABFT.
The results indicate that MSB-first truncation offers no tradeoff between r and allowed
error: the latter immediately jumps close to ∨(dout) and remains flat thereafter; expected
due to the fact that worst-case errors are introduced as soon as r 6= 0. While MSB-
121
05
10
15
20
25
30
35
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
30
35
40
45
50
55
60
65
70
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
5
10
15
20
25
30
35
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
30
35
40
45
50
55
60
65
70
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.19: Masked fault proportions for RP-ABFT-protected accelerator. Each plot
shows the proportion of total accelerator execution’s results that led to a
masked output for a particular combination of truncation type, s and r.
first truncation necessitates lower overhead incursion than either its LSB-first equivalent
or traditional ABFT, the results confirm that its effectiveness is severely limited when
considering allowed output error; although, for the largest r, the same is also true of LSB-
first truncation. RP-ABFT with LSB-first truncation does, however, allow for area (and
consequently power) and performance improvements while allowing relatively small errors
to pass. Although the mean results may seem high, recall that faults were emulated within
every simulation iteration, and that input data was drawn from a uniform distribution.
The results shown in Figures 6.16, 6.17, 6.18, 6.19 and 6.20 do not take area into account.
Additional plots, shown in Figures 6.22, 6.23, 6.24, 6.25 and 6.26, were therefore produced
to attempt to capture the effect of area upon likelihood of fault manifestation, thereby
scaling the previously seen fault proportions. Both s and r have an impact upon area.
For both variables, total RP-ABFT-enabled resource usage figures from Table 6.4 were
used to scale the results, with s = 2, r = 0 taken as the baseline. Detected and located
fault proportions were scaled down as area increased, while false positive, false negative
and masked fault proportions were scaled up.
122
010
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
20
40
60
80
100
0 4 8 12 16 20 24 28
P
ro
p
or
ti
on
of
re
su
lt
s
(%
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.20: Located fault proportions for RP-ABFT-protected accelerator. Each plot
shows the proportion of total accelerator execution’s results that led to a
successfully locatable output for a particular combination of truncation type,
s and r.
The area-scaled plots in Figures 6.22, 6.23, 6.24, 6.25 and 6.26 unfortunately show
undesirable behaviour across the truncation types, fault models and r tested for changes
in s. Due to the increasing likelihood of false positive, false negative and masked faults
occurring as r grows, and the increasing expected fault rate as s rises due to additional area
utilisation, fault detectability and locatability are both expected to erode substantially
as s increases for either MSB- or LSB-first truncation experiencing either permanent or
transient faults.
6.6 Conclusion
This chapter introduced a new control within ABFT allowing for the reduction of precision
within checksumming components. The resulting protection is referred to as RP-ABFT.
The design, implementation and evaluation of an RP-ABFT-protected hardware accelera-
tor were described: the mathematical principles of RP-ABFT were presented first, based
123
257
258
259
260
261
0 4 8 12 16 20 24 28
µ
(M
A
E
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
256
257
258
0 4 8 12 16 20 24 28
µ
(M
A
E
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
225
230
235
240
245
250
255
260
265
0 4 8 12 16 20 24 28
µ
(M
A
E
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
225
230
235
240
245
250
255
260
0 4 8 12 16 20 24 28
µ
(M
A
E
)
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.21: Means of maximum absolute errors for RP-ABFT-protected accelerator. In
each case of a false negative result, the maximum absolute error of each of the
output matrix’s data elements was calculated: this figure shows the means of
those results. Assuming that unmissed results are able to be corrected, these
plots’ results therefore represent the average expected worst-element errors
introduced by RP-ABFT.
upon which informed decisions were made regarding the truncation of parts of the stan-
dard ABFT protection circuitry needed for the chosen operator. Two distinct versions of
RP-ABFT, involving MSB- and LSB-first truncation, were theorised and implemented in
hardware. Experiments were conducted to ascertain the implications of RP-ABFT has
upon area and performance overheads, and fault injection simulations were carried out to
evaluate the fault observability of the developed protection mechanism.
The findings herein include that bit-width reduction of ABFT circuitry within a fault-
tolerant accelerator used for multiplying pairs of 32×32 matrices resulted in the reduction
of incurred area overhead by 16.7% and recovery of 8.27% of fmax versus the equivalent
‘vanilla’ ABFT protection at the cost of introducing average and maximum absolute output
errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under
transient fault injection.
124
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
0.05
0.1
0.15
0.2
0.25
0.3
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
0.05
0.1
0.15
0.2
0.25
0.3
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.22: Area-scaled detected fault proportions for RP-ABFT-protected accelerator,
with total resource usage figures used to scale for s and linear scaling used
for n.
6.6.1 Future Work
Several possible areas of future research regarding RP-ABFT have been identified.
Through the manipulation of output error threshold values, tradeoffs between false
positives and false negatives are achievable. Enhancements to the checksumming logic,
particularly for performance improvement, are also possible, along with the exploration of
output-only truncation to introduce additional tradeoff opportunities between overhead
and maximum allowed output error. Finally, the feasability of hybrid RP-ABFT-online
arithmetic implementations could be explored: by combining LSB-first truncation
RP-ABFT with existing work on online arithmetic [77] [78]—operations whose results
settle from MSB first—it is believed to be possible to create robust, overclocking-friendly
circuitry with self-verifying properties.
125
00.05
0.1
0.15
0.2
0.25
0.3
0.35
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.23: Area-scaled false positive fault proportions for RP-ABFT-protected acceler-
ator, with total resource usage figures used to scale for s and linear scaling
used for n.
126
ABFT Truncation Truncation Matrix size s
enabled type width r (bits) 2 4 8 16 32
✗ 692 1064 1788 3336 6655
(1.30%) (2.00%) (3.36%) (6.27%) (12.51%)
None
2059 2836 4389 8336 15976
(3.87%) (5.33%) (8.25%) (15.67%) (30.03%)
4
1436 1899 2926 4926 9182
(2.70%) (3.57%) (5.50%) (9.26%) (17.26%)
8
1351 1809 2798 4714 8852
(2.54%) (3.40%) (5.26%) (8.86%) (16.64%)
12
1303 1750 2703 4565 8539
(2.45%) (3.29%) (5.08%) (8.58%) (16.05%)
MSB-
16
1234 1644 2564 4362 8182
first (2.32%) (3.09%) (4.82%) (8.20%) (15.38%)
20
1154 1580 2437 4155 7927
(2.17%) (2.97%) (4.58%) (7.81%) (14.90%)
24
1080 1516 2351 4080 7538
(2.03%) (2.85%) (4.42%) (7.67%) (14.17%)
✓ 28 1005 1415 2218 3873 7272
(1.89%) (2.66%) (4.17%) (7.28%) (13.67%)
4
2655 3394 5011 8496 15694
(4.99%) (6.38%) (9.42%) (15.97%) (29.50%)
8
2516 3245 4825 8289 15401
(4.73%) (6.10%) (9.07%) (15.58%) (28.95%)
12
2378 3102 4634 8092 15056
(4.47%) (5.83%) (8.71%) (15.21%) (28.30%)
LSB-
16
2245 2963 4442 7820 14699
first (4.22%) (5.57%) (8.35%) (14.70%) (27.63%)
20
2107 2830 4277 7597 14385
(3.96%) (5.32%) (8.04%) (14.28%) (27.04%)
24
1958 2633 4102 7368 14103
(3.68%) (4.95%) (7.71%) (13.85%) (26.51%)
28
1766 2527 3942 7097 13704
(3.32%) (4.75%) (7.41%) (13.34%) (25.76%)
Table 6.1: Baseline, MSB-first & LSB-first truncated accelerator LUT usage, containing
the raw resource usage figures obtained for all implementations. Percentages
of the total number of each of these resources for the target device are also
included.
127
ABFT Truncation Truncation Matrix size s
enabled type width r (bits) 2 4 8 16 32
✗ 1213 1649 2511 4213 7618
(1.14%) (1.55%) (2.36%) (3.96%) (7.16%)
None
1915 2511 3703 6033 10693
(1.80%) (2.36%) (3.48%) (5.67%) (10.05%)
4
1713 2224 3266 5256 9331
(1.61%) (2.09%) (3.07%) (4.94%) (8.77%)
8
1670 2181 3203 5160 9172
(1.57%) (2.05%) (3.01%) (4.85%) (8.62%)
12
1639 2139 3139 5054 9023
(1.54%) (2.01%) (2.95%) (4.75%) (8.48%)
MSB-
16
1607 2096 3086 4958 8863
first (1.51%) (1.97%) (2.90%) (4.66%) (8.33%)
20
1564 2054 3022 4862 8704
(1.47%) (1.93%) (2.84%) (4.57%) (8.18%)
24
1532 2011 2958 4831 8544
(1.44%) (1.89%) (2.78%) (4.54%) (8.03%)
✓ 28 1490 1958 2905 4735 8395
(1.40%) (1.84%) (2.73%) (4.45%) (7.89%)
4
1883 2479 3639 5958 10544
(1.77%) (2.33%) (3.42%) (5.60%) (9.91%)
8
1841 2426 3575 5863 10385
(1.73%) (2.28%) (3.36%) (5.51%) (9.76%)
12
1809 2383 3511 5778 10225
(1.70%) (2.24%) (3.30%) (5.43%) (9.61%)
LSB-
16
1777 2341 3469 5682 10076
first (1.67%) (2.20%) (3.26%) (5.34%) (9.47%)
20
1734 2298 3405 5586 9916
(1.63%) (2.16%) (3.20%) (5.25%) (9.32%)
24
1702 2245 3352 5490 9757
(1.60%) (2.11%) (3.15%) (5.16%) (9.17%)
28
1660 2202 3288 5394 9597
(1.56%) (2.07%) (3.09%) (5.07%) (9.02%)
Table 6.2: Baseline, MSB-first & LSB-first truncated accelerator FF usage, containing the
raw resource usage figures obtained for all implementations. Percentages of the
total number of each of these resources for the target device are also included.
ABFT
Resource
Matrix size s
enabled 2 4 8 16 32
BRAM
12 24 48 96 192
✗ (4.29%) (8.57%) (17.14%) (34.29%) (68.57%)
DSP
8 16 32 64 128
(3.64%) (7.27%) (14.55%) (29.09%) (58.18%)
BRAM
12 24 48 96 192
✓ (4.29%) (8.57%) (17.14%) (34.29%) (68.57%)
DSP
12 20 36 68 132
(5.45%) (9.09%) (16.36%) (30.91%) (60.00%)
Table 6.3: Baseline, MSB-first & LSB-first truncated accelerator BRAM & DSP usage,
containing the raw resource usage figures obtained for all implementations. Per-
centages of the total number of each of these resources for the target device are
also included.
128
ABFT Truncation Truncation Matrix size s
enabled type width r (bits) 2 4 8 16 32
✗ 2.59% 4.85% 9.35% 18.40% 36.61%
None 3.85% 6.34% 11.31% 21.64% 42.16%
4 3.51% 5.83% 10.52% 19.85% 38.65%
8 3.46% 5.78% 10.44% 19.73% 38.46%
12 3.43% 5.74% 10.38% 19.63% 38.28%
MSB-
first
16 3.39% 5.68% 10.31% 19.52% 38.07%
20 3.35% 5.64% 10.23% 19.40% 37.91%
24 3.30% 5.60% 10.18% 19.35% 37.69%
✓ 28 3.26% 5.54% 10.10% 19.23% 37.53%
4 4.13% 6.59% 11.59% 21.69% 42.00%
8 4.05% 6.51% 11.48% 21.57% 41.82%
12 3.98% 6.43% 11.38% 21.46% 41.62%
LSB-
first
16 3.91% 6.36% 11.28% 21.31% 41.42%
20 3.83% 6.29% 11.19% 21.18% 41.23%
24 3.76% 6.18% 11.09% 21.05% 41.06%
28 3.66% 6.12% 11.00% 20.90% 40.84%
Table 6.4: Baseline, MSB-first & LSB-first truncated accelerator total resource usage, con-
taining the means of proportional resource usage figures obtained for all imple-
mentations to give an indication of overall resource utilisations.
ABFT
enabled
Truncation
type
Truncation
width r (bits)
Matrix size s
2 4 8 16 32
fmax (MHz)
✗ 117.10 118.72 113.78 114.29 102.37
✓
None 79.42 78.52 74.48 71.00 60.57
4 118.00 106.17 110.67 103.60 98.39
8 118.20 111.45 114.58 101.63 98.81
MSB-
first
12 115.18 112.90 108.17 104.62 99.86
16 120.27 111.80 115.46 103.54 101.98
20 114.48 110.72 109.53 104.02 100.89
24 110.33 114.64 113.87 107.72 101.84
28 123.31 116.39 109.48 110.04 102.75
4 78.62 77.25 73.45 70.00 61.34
8 81.24 77.32 71.76 69.70 62.31
LSB-
first
12 80.07 78.78 74.61 68.13 62.49
16 77.59 78.60 74.95 69.62 63.14
20 80.03 78.20 72.83 70.97 64.03
24 78.41 76.17 74.04 70.08 64.86
28 78.94 76.57 75.78 70.95 63.59
Table 6.5: Baseline, MSB-first & LSB-first truncated accelerator fmax. Averaged execution
times and maximum operating frequencies achieved are shown for each design
to allow side-by-side comparison of unprotected and the range of protected
implementations.
129
Truncation type Truncation width r (bits) csin width (bits) csout width (bits)
None 32 + log
2
(s) 64
MSB-first
4 28 28
8 24 24
12 20 20
16 16 16
20 12 12
24 8 8
28 4 4
LSB-first
4 28 + log
2
(s) 60
8 24 + log
2
(s) 56
12 20 + log
2
(s) 52
16 16 + log
2
(s) 48
20 12 + log
2
(s) 44
24 8 + log
2
(s) 40
28 4 + log
2
(s) 36
Table 6.6: RP-ABFT input & output checksum widths, listing the input and output check-
sum widths for each of the designs compiled. s = 32 in all cases.
0
1
2
3
4
5
6
7
8
9
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
2
4
6
8
10
12
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
1
2
3
4
5
6
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.24: Area-scaled false negative fault proportions for RP-ABFT-protected acceler-
ator, with total resource usage figures used to scale for s and linear scaling
used for n.
130
Algorithm 12 RP-ABFT fault injection simulation
1: create s× s matrix A
2: create s× s matrix B
3: A.rand fill()
4: B.rand fill()
5: create s× s matrix Cref
6: Cref ← AB
7: create (s+ 1)× s matrix Ac
8: create s× (s+ 1) matrix Br
9: Ac ← A.add cs(‘col’, truncation type, n, r)
10: Br ← B.add cs(‘row’, truncation type, n, r)
11: create (s+ 1)× (s+ 1) matrix Cf
12: if fault type = ‘perm’ then
13: bit mask, faulty cols← generate bit mask(s, n, 1, 1, 1)
14: for i = 0 to s do
15: for j = 0 to s do
16: Cf[i][j]← bit mask[j]
17: for k = 0 to s− 1 do
18: Cf[i][j]←
(
(Cf[i][j] +Ac[i][k]×Br[k][j]) mod 2
2n
)
bitwise or bit mask[j]
19: end for
20: end for
21: end for
22: else
23: bit mask, faulty cols← generate bit mask(s, n, 1, 3, 1)
24: for i = 0 to s do
25: for j = 0 to s do
26: Cf[i][j]← bit mask[i][j][0]
27: for k = 0 to s− 1 do
28: Cf[i][j] ←
(
(Cf[i][j] + Ac[i][k] × Br[k][j]) mod 2
2n
)
bitwise or
bit mask[i][j][k + 1]
29: end for
30: end for
31: end for
32: end if
33: data ok ← Cf.get data() = Cref
34: cs ok ← Cf.check cs(truncation type, n, r)
35: result type← classify result(data ok, cs ok)
36: located← Cf.diagnose cs(truncation type, n, r) = faulty cols
131
Algorithm 13 add cs() procedure used in RP-ABFT fault injection simulation
Require: cs type, trunc type, n, r
1: if cs type = ‘col’ then
2: create (s+ 1)× s matrix Ac
3: Ac[0 · · s− 1][··]← A
4: for j = 0 to s− 1 do
5: Ac[s][j]← 0
6: for i = 0 to s− 1 do
7: if trunc type = ‘none’ then
8: Ac[s][j]← (Ac[s][j] +A[i][j]) mod 2
2n
9: else if trunc type = ‘msb first’ then
10: Ac[s][j]← (Ac[s][j] +A[i][j]) mod 2
n−r
11: else
12: Ac[s][j]← Ac[s][j] + (A[i][j]≫ r)
13: end if
14: end for
15: end for
16: return Ac
17: else if cs type = ‘row’ then
18: create s× (s+ 1) matrix Br
19: Br[··][0 · · s− 1]← B
20: for i = 0 to s− 1 do
21: Br[i][s]← 0
22: for j = 0 to s− 1 do
23: if trunc type = ‘none’ then
24: Br[i][s]← (Br[i][s] +B[i][j]) mod 2
2n
25: else if trunc type = ‘msb first’ then
26: Br[i][s]← (Br[i][s] +B[i][j]) mod 2
n−r
27: else
28: Br[i][s]← Br[i][s] + (B[i][j]≫ r)
29: end if
30: end for
31: end for
32: return Br
33: else
34: {Perform steps for cs type = ‘col’ and ‘row’, returning (s+ 1)× (s+ 1) matrix Cf}
35: end if
132
Algorithm 14 check cs() procedure used in RP-ABFT fault injection simulation
Require: truncation type, n, r
1: create s× s matrix Ccopy
2: Ccopy ← Cf.get data()
3: create (s+ 1)× (s+ 1) matrix Ccopy, f
4: Ccopy, f ← Ccopy.add cs(cs type, ‘full’, n, r)
5: for all cs type in {‘col’, ‘row’} do
6: create s+ 1 vector cs
7: create s+ 1 vector cs
8: cs← Cf.get cs(cs type)
9: cscopy ← Ccopy, f.get cs(cs type)
10: if truncation type = ‘none’ then
11: if cs 6= cscopy then
12: return false
13: end if
14: else if truncation type = ‘msb first’ then
15: for x = 0 to s do
16: if cs[x] mod 2n−r 6= cscopy[x] then
17: return false
18: end if
19: end for
20: else
21: θ = s22n−1
22: θc = s
32n−r
23: for x = 0 to s− 1 do
24: if
∣∣(cs[x]− cscopy[x]) mod 22n−r
∣∣ > θ then
25: return false
26: end if
27: end for
28: if
∣∣(cs[s]− cscopy[s]) mod 22(n−r)
∣∣ > θc then
29: return false
30: end if
31: end if
32: end for
33: return true
133
00.05
0.1
0.15
0.2
0.25
0.3
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
1
2
3
4
5
6
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.25: Area-scaled masked fault proportions for RP-ABFT-protected accelerator,
with total resource usage figures used to scale for s and linear scaling used
for n.
134
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(a) MSB-first, permanent faults
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(b) MSB-first, transient faults
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(c) LSB-first, permanent faults
0
0.2
0.4
0.6
0.8
1
0 4 8 12 16 20 24 28
A
re
a-
sc
al
ed
p
ro
p
or
ti
on
of
re
su
lt
s
Truncation width r (bits)
s = 2
s = 4
s = 8
s = 16
s = 32
(d) LSB-first, transient faults
Figure 6.26: Area-scaled located fault proportions for RP-ABFT-protected accelerator,
with total resource usage figures used to scale for s and linear scaling used
for n.
135
7 Conclusion
The work described in this thesis addresses current hardware reliability concerns. By
employing low-overhead fault tolerance, as the experimental results herein have shown,
engineers can gain highly robust datapaths by sacrificing some throughput and, most
importantly, only limited additional area: critical for power efficiency and to limit the
effect expanding physical size has upon increasing the occurrence of faults. As reliability
becomes of greater concern for a wider variety of applications and settings due to increases
in variability, degradation and fault susceptibility caused by continued process scaling,
reliability mechanisms that necessitate the introduction of only limited overheads, such as
algorithm-based fault tolerance (ABFT) and its derivatives, are likely to prove popular.
While the majority of the applicational focus within the technical content of this thesis
has been on a single mathematical operation—matrix multiplication—the techniques de-
veloped are applicable to a wide range of other operators with relatively minor implemen-
tational changes. Generality has been demonstrated through the exploration of fault ob-
servability for additional operators—matrix addition and matrix-vector addition, the latter
representative of the behaviour of linear filters—and the ABFT-hardening of other linear
algebra-heavy applications, including those featuring lower- and upper-triangular (LU)
decompositions or discrete Fourier transforms (DFTs), remains possible.
7.1 Summary
A benchmark hardware accelerator using ABFT for runtime datapath error detection was
introduced as a case study for demonstrating the ability to achieve high fault detectability
without incurring significant area (and consequently power) or performance overheads.
Near-100% detectability was achieved while an area penalty of just 7.87% was incurred.
Throughput was reduced by 31.3%. The area overhead figure, in particular, compares ex-
tremely favourably with that which would be incurred through using modular redundancy
to achieve similar levels of resilience, with triple modular redundancy (TMR) necessitating
136
in excess of 200% additional area, for example.
Full fault detection and repair was achieved within the same accelerator with
two mechanisms—data-shifting logic and dynamic partial reconfiguration (DPR)—
implemented to allow operands to be routed around resources identified as faulty during
operation. Area overheads for a complete, hardened application of 12.4% and 10.1% were
encountered for additional logic-shifting and DPR, respectively; the latter representing a
50.7% overhead reduction over the former. In the DPR case, area overhead was reduced
at the expense of partial bitstream storage; 5.10MB of memory was required. Single
fault locatability of 96.7% was achieved in return for a 19.7% throughput penalty during
fault-free operation, increasing to an average of 33.8% in order to correct a single fault.
A simulation framework was developed to allow the fault observabilities of several
ABFT-protected operators to be established. By producing efficient simulation software,
hundreds of millions of simulations could be completed within hours, allowing accurate in-
sights into fault observability to be made fairly quickly. Object-oriented and user-friendly,
the framework is extensible; more operators and fault patterns could be added with rel-
ative ease. Analysis of the results obtained demonstrated ABFT to be an effective error
detection mechanism across the three operators considered.
The final technical work presented in this thesis covered the development of reduced-
precision algorithm-based fault tolerance (RP-ABFT); a derivative of ABFT in which
selective bit-width reduction is made in order to trade off incurred area and performance
overheads for limited fault detectability. RP-ABFT using checksum logic truncation from
the least-significant bit (LSB) first was found to be effective in achieving this: 16.7%
of area and 8.27% of the throughput overheads caused by the introduction of ABFT
protection were recovered by allowing low-magnitude errors (sub-1% of the maximum
absolute output value) to propagate while retaining robustness against faults likely to
cause high-magnitude errors. Findings indicate that RP-ABFT with LSB-first truncation
represents a useful addition to the reliability ‘toolbox.’
7.2 Future Work
While matrix multiplication represented a solid case study for the implementation of a
complete hardware fault tolerance solution presented in Chapters 3 and 4, the generalisa-
tion of the techniques developed therein to other linear algebraic operators is considered
to be an important avenue for further work. Refinements to the accelerator design, partic-
137
ularly facilitating more comprehensive pipelining of the checksum generation and verifica-
tion logic to allow timing model-inferred maximum operating frequency (fmax) recovery,
also remain possible.
Several areas of future work have been determined based upon outcomes of Chapter 5’s
fault observability experiments. The first concerns a potential design tool capable of creat-
ing hardware for the acceleration of a range of mathematical operations, and combinations
thereof, for linear algebra and signal processing applications. Basing such a tool upon
an established, and preferably cross-platform, high-level development framework such as
Open Computing Language (OpenCL) [79]—supported by both major field-programmable
gate array (FPGA) vendors [80] [81]—would allow for verification across a wider array of
usage scenarios and facilitate greater awareness of the methods developed. The data un-
derpinning Chapter 5’s results will be fundamental to the selection of design parameters
made by both the tool itself and its users. Additional mathametical operations suitable
for ABFT hardening, particularly the DFT, will also be explored.
Finally, regarding Chapter 6’s discussion of RP-ABFT, developments are planned for
its integration with both a high-level development framework for more accessible area-
performance-reliability tuning and online arithmetic operators to achieve bounded over-
clocking error in ‘reliably unreliable’ circuitry.
138
Bibliography
[1] C. Stroud, S. Konala, P. Chen, and M. Abramovici, “Built-in Self-test of Logic Blocks
in FPGAs (Finally, a Free Lunch: BIST without Overhead!),” in IEEE VLSI Test
Symposium, 1996.
[2] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “A Transition Probability-based Delay
Measurement Method for Arbitrary Circuits on FPGAs,” in International Conference
on Field-programmable Technology (FPT), 2008.
[3] M. Abramovici, C. Strond, C. Hamilton, S. Wijesuriya, and V. Verma, “Using Roving
STARs for On-line Testing and Diagnosis of FPGAs in Fault-tolerant Applications,”
in IEEE International Test Conference (ITC), 1999.
[4] J. M. Levine, E. Stott, G. A. Constantinides, and P. Y. K. Cheung, “Online Measure-
ment of Timing in Circuits: for Health Monitoring and Dynamic Voltage & Frequency
Scaling,” in IEEE International Symposium on Field-Programmable Custom Comput-
ing Machines (FCCM), 2012.
[5] E. Stott and P. Y. K. Cheung, “Improving FPGA Reliability with Wear-levelling,”
in International Conference on Field-programmable Logic and Applications (FPL),
2011.
[6] J. Lach, W. H. Mangione-Smith, and M. Potkonjak, “Low-overhead Fault-tolerant
FPGA Systems,” IEEE Transactions on VLSI Systems, vol. 6, no. 2, 1998.
[7] F. Hanchek and S. Dutt, “Node Covering-based Defect and Fault Tolerance Methods
for Increased Yield in FPGAs,” in International Conference on VLSI Design, 1996.
[8] J. M. Emmert and D. K. Bhatia, “A Fault-tolerant Technique for FPGAs,” Journal
of Electronic Testing: Theory and Applications (JETTA), vol. 16, no. 6, 2000.
139
[9] R. F. DeMara and K. Zhang, “Autonomous FPGA Fault-handling through Compet-
itive Runtime Reconfiguration,” in NASA/DoD Conference on Evolvable Hardware
(EH), 2005.
[10] S. Srinivasan, P. Mangalagiri, Y. Xie, N. Viiaykrishnan, and K. Sarpatwari, “FLAW:
FPGA Lifetime Awareness,” in ACM/IEEE Design Automation Conference, 2006.
[11] S. Zafar, Y. Kim, V. Narayanan, C. Cabral, V. Paruchuri, B. Doris, J. Stathis,
A. Callegari, and M. Chudzik, “A Comparative Study of NBTI and PBTI (Charge
Trapping) in SiO2/HfO2 Stacks with FUSI, TiN, Re Gates,” in IEEE Symposium on
VLSI Technology, 2006.
[12] E. Stott, P. Sedcole, and P. Y. K. Cheung, “Fault Tolerance and Reliability in Field-
programmable Gate Arrays,” IET Computers & Digital Techniques, vol. 4, no. 3,
2010.
[13] S. Kiamehr, A. Amouri, and M. B. Tahoori, “Investigation of NBTI- and PBTI-
induced Aging in Different LUT Implementations,” in International Conference on
Field-Programmable Technology (FPT), 2011.
[14] E. Stott, J. S. J. Wong, and P. Y. K. Cheung, “Degradation Analysis and Mitigation in
FPGAs,” in International Conference on Field-programmable Logic and Applications
(FPL), 2010.
[15] Altera, “Cyclone III FPGAs.” http://www.altera.com/products/fpga/cyclone-
series/cyclone-iii/overview.html.
[16] C. Stroud, E. Lee, S. Konala, and M. Abramovici, “Using ILA Testing for BIST in
FPGAs,” in IEEE International Test Conference (ITC), 1996.
[17] A. Alaghi, M. Sadoughi Yarandi, and Z. Navabi, “An Optimum ORA BIST for Mul-
tiple Fault FPGA Look-up Table Testing,” in IEEE Asian Test Symposium (ATS),
2006.
[18] C. Stroud, S. Wijesuriya, C. Hamilton, and M. Abramovici, “Built-in Self-test of
FPGA Interconnect,” in IEEE International Test Conference (ITC), 1998.
[19] I. G. Harris and R. Tessier, “Testing and Diagnosis of Interconnect Faults in Cluster-
based FPGA Architectures,” IEEE Transactions on Computer-aided Design of Inte-
grated Circuits and Systems, vol. 21, no. 11, 2002.
140
[20] C.-L. Hsu and T.-H. Chen, “Built-in Self-test Design for Fault Detection and Fault
Diagnosis in SRAM-based FPGAs,” IEEE Transactions on Instrumentation and Mea-
surement, vol. 58, no. 7, 2009.
[21] J. Liu and S. Simmons, “BIST Diagnosis of Interconnect Fault Locations in FPGAs,”
in Canadian Conference on Electrical and Computer Engineering, 2003.
[22] P. Girard, O. Heron, S. Pravossoudovitch, and M. Renovell, “Defect Analysis for
Delay-fault BIST in FPGAs,” in IEEE On-line Testing Symposium (IOLTS), 2003.
[23] J. S. J. Wong, P. Sedcole, and P. Y. K. Cheung, “Self-measurement of Combinatorial
Circuit Delays in FPGAs,” ACM Transactions on Reconfigurable Technology and
Systems (TRETS), vol. 2, no. 2, 2009.
[24] I. G. Harris, P. R. Menon, and R. Tessier, “BIST-based Delay Path Testing in FPGA
Architectures,” in IEEE International Test Conference (ITC), 2001.
[25] C.-C. Wang, J.-J. Liou, Y.-L. Peng, C.-T. Huang, and C.-W. Wu, “A BIST Scheme
for FPGA Interconnect Delay Faults,” in IEEE VLSI Test Symposium, 2005.
[26] Y.-L. Peng, C.-W. Wu, J.-J. Liou, and C.-T. Huang, “BIST-based Diagnosis Scheme
for Field-programmable Gate Array Interconnect Delay Faults,” IET Computers &
Digital Techniques, vol. 1, no. 6, 2007.
[27] M. A. Lusco, J. L. Dailey, and C. E. Stroud, “Built-in Self-test for Multipliers in
Altera Cyclone II Field-programmable Gate Arrays,” in Southeastern Symposium on
System Theory (SSST), 2011.
[28] M. D. Pulukuri and C. E. Stroud, “Built-in Self-test of Digital Signal Processors in
Virtex-4 FPGAs,” in Southeastern Symposium on System Theory (SSST), 2009.
[29] Z. Zhang, Z. Wen, and L. Chen, “BIST Approach for Testing Embedded Memory
Blocks in System-on-Chips,” in IEEE Circuits and Systems International Conference
on Testing and Diagnosis (ICTD), 2009.
[30] P. Sedcole, J. S. J. Wong, and P. Y. K. Cheung, “Characterisation of FPGA Clock
Variability,” in IEEE Computer Society Symposium on VLSI (ISVLSI), 2008.
[31] C. E. Stroud and N. S. Da Cunha, “Built-in Self-test of Programmable Clock Buffers
in Virtex-4, Virtex-5 and Virtex-6 FPGAs,” in Southeastern Symposium on System
Theory (SSST), 2011.
141
[32] S. Sunter and A. Roy, “Adaptive Parametric BIST of High-speed Parallel I/Os via
Standard Boundary Scan,” in IEEE International Test Conference (ITC), 2011.
[33] M. Abramovici, C. Stroud, B. Skaggs, and J. Emmert, “Improving On-line BIST-
based Diagnosis for Roving STARs,” in IEEE International On-line Testing Work-
shop, 2000.
[34] S. Dutt, V. Verma, and V. Suthar, “Built-in Self-test of FPGAs with Provable Diag-
nosabilities and High Diagnostic Coverage with Application to Online Testing,” IEEE
Transactions on Computer-aided Design of Integrated Circuits and Systems, vol. 27,
no. 2, 2008.
[35] V. Suthar and S. Dutt, “Efficient On-line Interconnect Testing in FPGAs with Prov-
able Detectability for Multiple Faults,” in Design, Automation and Test in Europe
(DATE), 2006.
[36] N. R. Shnidman, W. H. Mangione-Smith, and M. Potkonjak, “Fault Scanner for
Reconfigurable Logic,” in Conference on Advanced Research in VLSI, 1997.
[37] M. Abramovici and C. Stroud, “BIST-based Delay-fault Testing in FPGAs,” in IEEE
International On-line Testing Workshop, 2002.
[38] S. D’Angelo, C. Metra, S. Pastore, A. Pogutz, and G. R. Sechi, “Fault-tolerant Voting
Mechanism and Recovery Scheme for TMR FPGA-based Systems,” in IEEE Inter-
national Symposium on Defect and Fault Tolerance in VLSI Systems, 1998.
[39] M. A. Sullivan, H. H. Loomis, and A. A. Ross, “Employment of Reduced-precision
Redundancy for Fault-tolerant FPGA Applications,” in International Symposium on
Field-Programmable Custom Computing Machines (FCCM), 2009.
[40] B. Pratt, M. Fuller, M. Rice, and M. Wirthlin, “Reduced-precision Redundancy for
Reliable FPGA Communications Systems in High-radiation Environments,” IEEE
Transactions on Aerospace and Electronic Systems, vol. 49, no. 1, 2013.
[41] R. Glein, B. Schmidt, F. Rittner, J. Teich, and D. Ziener, “A Self-adaptive SEU Mit-
igation System for FPGAs with an Internal Block RAM Radiation Particle Sensor,”
in International Symposium on Field-Programmable Custom Computing Machines
(FCCM), 2014.
142
[42] F. G. de Lima Kastensmidt, G. Neuberger, R. F. Hentschke, L. Carro, and R. Reis,
“Designing Fault-tolerant Techniques for SRAM-based FPGAs,” IEEE Design & Test
of Computers, vol. 21, no. 6, 2004.
[43] A. L. Burress and P. K. Lala, “On-line Testable Logic Design for FPGA Implemen-
tation,” in IEEE International Test Conference (ITC), 1997.
[44] R. Karri and N. Mukherjee, “Versatile BIST: an Integrated Approach to On-line/Off-
line BIST,” in IEEE International Test Conference (ITC), 1998.
[45] P. Nigh and W. Maly, “A Self-testing ALU using Built-in Current Sensing,” in IEEE
Custom Integrated Circuits Conference, 1989.
[46] M. Nicolaidis, “On-line Testing for VLSI,” in IEEE International Test Conference
(ITC), 1997.
[47] D. Mange, M. Goeke, D. Madon, A. Stauffer, G. Tempesti, S. Durand, P. Marchal, and
P. Nussbaum, “Embryonics: a New Family of Coarse-grained FPGA with Self-repair
and Self-reproducing Properties,” in IEEE International Symposium on Circuits and
Systems (ISCAS), 1996.
[48] C. Ortega-Sanchez, A. Tyrrell, D. Mange, A. Stauffer, and G. Tempesti, “Reliability
Analysis of a Self-repairing Embryonic Machine,” in Euromicro Conference, 2000.
[49] V. Lakamraju and R. Tessier, “Tolerating Operational Faults in Cluster-based FP-
GAs,” in ACM/SIGDA International Symposium on Field Programmable Gate Arrays
(FPGA), 2000.
[50] J. Narasimham, K. Nakajima, C. Rim, and A. T. Dahbura, “Yield Enhancement
of Programmable ASIC Arrays by Reconfiguration of Circuit Placements,” IEEE
Transactions on CAD of Integrated Circuits and Systems (TCAD), vol. 13, no. 8,
1994.
[51] A. Mathur and C. L. Liu, “Timing-driven Placement Reconfiguration for Fault Tol-
erance and Yield Enhancement in FPGAs,” in European Design and Test Conference
(DATE), 1996.
[52] B. Girau, P. Marchal, P. Nussbaum, A. Tisserand, and H. F. Restrepo, “Evolvable
Platform for Array Processing: a One-chip Approach,” in International Conference
on Microelectronics for Neural, Fuzzy and Bio-Inspired Systems (MicroNeuro), 1999.
143
[53] A. Miele and P. di Torino, “A Software Framework for Dynamic Self-repair in Em-
bedded SoCs Exploiting Reconfigurable Devices,” in IEEE International Conference
on Automation, Quality and Testing, Robotics (AQTR), 2010.
[54] K.-H. Huang and J. A. Abraham, “Algorithm-Based Fault Tolerance for Matrix Op-
erations,” IEEE Transactions on Computers, vol. C-33, no. 6, 1984.
[55] S.-J. Wang and N. K. Jha, “Algorithm-based Fault Tolerance for FFT Networks,”
IEEE Transactions on Computers, vol. 43, no. 7, 1994.
[56] A. Jacobs, G. Cieslewski, and A. D. George, “Overhead and Reliability Analysis of
Algorithm-based Fault Tolerance in FPGA Systems,” in International Conference on
Field-programmable Logic and Applications (FPL), 2012.
[57] J. Rexford and N. K. Jha, “Algorithm-based Fault Tolerance for Floating-point Op-
erations in Massively Parallel Systems,” in International Symposium on Circuits and
Systems (ISCAS), vol. 2, 1992.
[58] C. Braun, S. Halder, and H. J. Wunderlich, “A-ABFT: Autonomous Algorithm-based
Fault Tolerance for Matrix Multiplications on Graphics Processing Units,” in Inter-
national Conference on Dependable Systems and Networks (DSN), 2014.
[59] J.-Y. Jou and J. A. Abraham, “Fault-tolerant Matrix Arithmetic and Signal Process-
ing on Highly Concurrent Computing Structures,” Proceedings of the IEEE, vol. 74,
no. 5, 1986.
[60] J. J. Davis and P. Y. K. Cheung, “Datapath Fault Tolerance for Parallel Accelera-
tors,” in International Conference on Field-programmable Technology (FPT), 2013.
[61] Xilinx, “Zynq-7000 All Programmable SoC.” http://www.xilinx.com/products/
silicon-devices/soc/zynq-7000.html.
[62] Xilinx, “Zynq-7000 AP SoC (XC7Z010 and ZC7Z020) Data Sheet.” http://www.
xilinx.com/support/documentation/data_sheets/ds187-XC7Z010-XC7Z020-
Data-Sheet.pdf.
[63] Xilinx, “LogiCORE IP XPS Central DMA Controller (v2.03a).” http://www.
xilinx.com/support/documentation/ip_documentation/xps_central_dma.pdf.
144
[64] Xilinx, “LogiCORE IP AXI Block RAM (BRAM) Controller (v1.03a).” http://
www.xilinx.com/support/documentation/ip_documentation/axi_bram_ctrl/
v1_03_a/ds777_axi_bram_ctrl.pdf.
[65] Xilinx, “LogiCORE AXI Interconnect IP (v1.01.a).” http://www.xilinx.com/
support/documentation/ip_documentation/ds768_axi_interconnect.pdf.
[66] Xilinx, “LogiCORE IP Interrupt Control (v2.01a).” http://www.xilinx.com/
support/documentation/ip_documentation/interrupt_control.pdf.
[67] Avnet, “Introduction to the Zynq-7000 Extensible Processing Platform.” http://
www.em.avnet.com/en-us/design/trainingandevents/Documents/X-FEST
%202012%20PRESENTATIONS/xfest12_pdf_zynq_intro_v1_1_april29.pdf.
[68] Xilinx, “7 Series DSP48E1 Slice, UG479.” http://www.xilinx.com/support/
documentation/user_guides/ug479_7Series_DSP48E1.pdf.
[69] Python, “16.6. multiprocessing – Process-based ‘Threading’ Interface.” http://docs.
python.org/2/library/multiprocessing.html.
[70] AMD, “AMD Server Processors.” http://www.amd.com/en-us/products/server.
[71] J. J. Davis and P. Y. K. Cheung, “Reducing Overheads for Fault-tolerant Datapaths
with Dynamic Partial Reconfiguration,” in IEEE International Symposium on Field-
programmable Custom Computing Machines (FCCM), 2014.
[72] J. J. Davis and P. Y. K. Cheung, “Achieving Low-overhead Fault Tolerance for Paral-
lel Accelerators with Dynamic Partial Reconfiguration,” in International Conference
on Field-programmable Logic and Applications (FPL), 2014.
[73] Xilinx, “Partial Reconfiguration of a Hardware Accelerator on Zynq-7000 All-
programmable SoC Devices.” http://www.xilinx.com/support/documentation/
application_notes/xapp1159-partial-reconfig-hw-accelerator-zynq-7000.
pdf.
[74] Avnet, “Avnet Product Brief – ZedBoard.” http://zedboard.org/sites/default/
files/product_briefs/PB-AES-Z7EV-7Z020_G-v12.pdf.
[75] J. J. Davis and P. Y. K. Cheung, “Reduced-precision Algorithm-based Fault Toler-
ance for FPGA-implemented Accelerators,” in International Workshop on Applied
Reconfigurable Computing (ARC), 2016.
145
[76] Python, “5. Built-in Types.” http://docs.python.org/2/library/stdtypes.html.
[77] K. Shi, D. Boland, E. Stott, S. Bayliss, and G. Constantinides, “Datapath Synthe-
sis for Overclocking: Online Arithmetic for Latency-accuracy Tradeoffs,” in Design
Automation Conference (DAC), 2014.
[78] K. Shi, D. Boland, and G. Constantinides, “Efficient FPGA Implementation of
Digit-parallel Online Arithmetic Operators,” in International Conference on Field-
Programmable Technology (FPT), 2014.
[79] K. Group, “OpenCL – Khronos Group.” http://www.khronos.org/opencl.
[80] Altera, “Altera SDK for OpenCL.” http://www.altera.com/products/design-
software/embedded-software-developers/opencl/overview.html.
[81] Xilinx, “Xilinx SDAccel.” http://www.xilinx.com/publications/prod_mktg/
sdnet/sdaccel-wp.pdf.
146
