Achieving Low-overhead Fault Tolerance for Parallel Accelerators with Dynamic Partial Reconfiguration by Davis, J & Cheung, PYK
Achieving Low-overhead Fault Tolerance for Parallel
Accelerators with Dynamic Partial Reconfiguration
James J. Davis and Peter Y. K. Cheung
Department of Electrical and Electronic Engineering
Imperial College London
London, SW7 2AZ, United Kingdom
E-mail: {james.davis06, p.cheung}@imperial.ac.uk
Abstract—While allowing for the fabrication of increasingly
complex and efficient circuitry, transistor shrinkage and count-
per-device expansion have major downsides: chiefly increased
variation, degradation and fault susceptibility. For this reason,
design-time consideration of fault tolerance will have to be given
to increasing numbers of electronic systems in the future to ensure
yields, reliabilities and lifetimes remain acceptably high. Many
commonly implemented operators are suited to modification
resulting in datapath error detection capabilities with far lower
area requirements. FPGAs are uniquely placed to allow further
area savings to be made when incorporating fault avoidance
mechanisms thanks to their dynamic reconfigurability.
In this paper, we examine the practicalities and costs in-
volved in implementing hardware-software fault tolerance on
a test platform: a parallel matrix multiplication accelerator in
hardware, with controller in software, running on a Xilinx Zynq
system-on-chip. A combination of ‘bolt-on’ error detection logic
and software-triggered routing reconfiguration serve to provide
low-overhead datapath fault tolerance at runtime. Rapid yet
accurate fault diagnoses along with low hardware (area), software
(configuration storage) and performance penalties are achieved.
I. INTRODUCTION
While the structure of FPGAs make them highly suitable
for the realisation of high-performance hardware, typically
through parallelisation and pipelining, their high transistor
counts and reliance upon RAM for configuration storage make
them particularly sensitive to the effects of variation and
degradation. Their runtime reconfigurability, however, allows
the application of unique strategies for fault correction. Here,
we demonstrate the ability to reduce algorithmic parallelisation
at runtime in order to maintain accurate operation, using both
algorithm-based fault tolerance (ABFT) and dynamic partial
reconfiguration (DPR) techniques to do so while keeping the
overheads incurred as a result low.
The key contributions of this work are
• the first implementation of ABFT-protected hardware
using DPR for recovery in the presence of failure, and
• a quantitative analysis of the overheads—of resources,
memory and performance—incurred through the in-
corporation of that fault tolerance strategy into a
benchmark hardware accelerator.
II. BACKGROUND
A. Runtime Fault-tolerant Methods
A wide array of fault detection and correction techniques
exist that are specific both to FPGAs and to ASICs in general.
A comprehensive analysis of fault-tolerant methods specific
to FPGAs was published recently [1], and those of particular
relevance to this research are reviewed here. Table I places this
work within a side-by-side comparison of competing families
of fault tolerance strategies.
Modular redundancy, particularly triple (TMR), remains
popular. While design vulnerability to faults is low for those
protected by TMR, its application comes with a high cost—
over 200% area overhead—and its ability to mask faults
breaks down when they occur in more than one replicated
module simultaneously. Fault detection is possible with singly
replicated modules, as in duplicate-with-compare (DWC), but
correction is not. Concurrent error detection (CED) schemes
generally have lower overheads but suffer from confounding:
often, many faults produce the same error codes.
The reconfigurability of FPGAs presents many interesting
options for runtime fault tolerance. DPR can be used to
allow portions of an FPGA to be tested while the remainder
continues to function as normal [2]. While high (up to 96%
for logic [3] and 99% for interconnect [4]) fault coverage has
been achieved, such ‘roving’ schemes introduce limitations
and complications to the operating configuration: the avail-
able resources are reduced in number, path-lengthening forces
frequency penalties to be incurred (2.5–15.1% decreases in
frequency have been reported [2]) and care must be taken to
avoid glitches when resource substitution takes place. Detec-
tion latencies are limited by the rate of chip scanning.
Given the impracticalities of performing on-chip place-
and-route, repair strategies tend to involve the storage of
precompiled alternative configurations [5] or design-time reser-
vation of spare resources [6] that allow the reorganisation of
logic to avoid faults while maintaining identical operation.
Where recompilation at runtime is required, time and power
requirements can be reduced by containing repairs within
subsections of an FPGA [7]. Many faults cannot be tolerated,
however, and spare resource overheads are often high.
B. Algorithm-based Fault Tolerance
The application of fault tolerance at a level above tran-
sistors, gates or small circuits—that is, at an algorithmic
level—makes it possible to reduce the impacts, in terms of
resource usage, performance or both, of those mechanisms
while ensuring that reliability remains high. Many linear
algebra operations, common in FPGA applications, can be
protected with such algorithm-based techniques: examples
include matrix operations [8] and Fourier transformations [9].
Recently published work [10] focussed on design vulnerability
reduction using ABFT applied to a matrix multiplier on an
FPGA. Our previous work [11], meanwhile, used ABFT for
error detection in the same operator while adding additional,
fixed logic for dynamic resource reallocation to facilitate fault
TABLE I. COMPARISON OF RUNTIME FAULT-TOLERANT METHODS
Method Detection Correction Fault Overheads Limitationstypes Area Performance
Scrubbing – Reconfiguration Transient – – (Must be paired with detection scheme)
DWC Redundancy – All High (> 100%) Low No fault locatability
TMR Redundancy Masking All High (> 200%) Low Becomes DWC after single failure
CED Parity (or similar) – All Low Low Low fault detectability
Re-execution Time redundancy Re-execution Transient Low–moderate High (> 100%) Low–moderate detection latency
Roving Test vectors (Reconfiguration) All Low–moderate Moderate Forces path-lengthening, low detection latency
ABFT Algorithmic (Algorithmic) All Low–moderate Moderate Algorithm-specific
This work Algorithmic Reconfiguration All Low–moderate (≈ 10%) Moderate (≈ 25%) Algorithm-specific
avoidance at runtime. In this work, we call upon ABFT once
again for its low-overhead error detection properties but make
use of DPR for the purposes of fault avoidance.
Matrix multiplication was chosen as a case study in this
work since it is used in many hardware-accelerated applica-
tions and because the adaptation of its operation for ABFT is
straightforward. In order to protect the operation Q = AB,
two new matrices—A′ and B′—are formed such that
A
′
=


a1,1 · · · a1,N
.
.
.
.
.
.
.
.
.
aN,1 · · · aN,N∑N
n=1 an,1 · · ·
∑N
n=1 an,N

 , B′ =


b1,1 · · · b1,N
∑N
n=1 b1,n
.
.
.
.
.
.
.
.
.
.
.
.
bN,1 · · · bN,N
∑N
n=1 bN,n


A
′ is an expanded version of A: its (N + 1)th row
comprises checksums of its previous rows’ values on a column-
by-column basis. Conversely, the (N + 1)th column of B′
comprises checksums of its previous columns’ values on a
row-by-row basis. A′B′ yields
Q
′
=


q1,1 · · · q1,N
∑N
n=1 q1,n
.
.
.
.
.
.
.
.
.
.
.
.
qN,1 · · · qN,N
∑N
n=1 qN,n∑N
n=1 qn,1 · · ·
∑N
n=1 qn,N
∑N
n=1 q
′
n,N+1 =
∑N
n=1 q
′
N+1,n


The (N+1)th row and column of Q′ comprise column and
row checksums, respectively, of the same format as those in
A
′
and B′ . Post-multiplication verification of the checksums
contained within Q′ allows the result obtained to be confirmed
as accurate to within a high degree of confidence. Consider
Q = AB =
(
1 2
3 4
)
×
(
5 6
7 8
)
=
(
19 22
43 50
)
which, once input checksumming has been added, becomes
Q
′
= A
′
B
′
=
(
1 2
3 4
4 6
)
×
(
5 6 11
7 8 15
)
=
(
19 22 41
43 50 93
62 72 134
)
Throughout this paper, A, B and Q will have dimensions
N × N , however this is merely for simplification and not a
requirement imposed upon the hardware by this method.
III. IMPLEMENTATION
A. System Overview
The fault-tolerant system developed, a high-level overview
of which is presented in Figure 1, was implemented wholly
upon a Xilinx Zynq-7000 XC7Z020 system-on-chip (SoC).
Our hardened matrix multiplication accelerator sits on the
programmable logic (PL) portion of the device. A region of
the accelerator—represented by a dashed rectangle—is recon-
figurable. Wrapped around the accelerator are block RAM
(BRAM) and direct memory access (DMA) controllers to
facilitate data transfers between the accelerator and off-chip
dynamic RAM (DRAM). On the SoC’s processor subsystem
(PS) side, a single ARM core is used to trigger memory
transfers, accelerator runs and necessary reconfigurations via
a software driver. The processor is also used for execution
time measurement and result verification. While used here,
an embedded processor is not a requirement: a soft core or
fully customised logic implemented upon any dynamically
reconfigurable FPGA could be used in its place, although each
would require an increase in design effort.
DRAM
ARM core DRAM
controller
PS
PL
Interrupt
controller
Config.
port
AXI4-Lite
interface
AXI4
interface
DMA
controller
BRAM
controller Accelerator
BRAM
controller
b
b
b
Fig. 1. System block diagram
B. Matrix Multiplication Datapath
The accelerator’s datapath is shown in Figure 2. At the
its heart are N + 1 parallel multiply-accumulators (MACs)—
one per column of Q′—allowing matrix multiplication to be
reduced from an O(N3) operation to O(N2). Wide (DN - and
D(N + 1)-bit, for D-bit matrix element data) BRAMs are
used to allow complete matrix row access on a cycle-by-cycle
basis. Checksum generation and verification logic, explained
in Section III-C, and reconfigurable routing, represented by
dashed squares and explained in Section III-E, are added to
harden the datapath.
While data width D, matrix dimensionality N , multiplier
and RAM styles and multiplier and accumulator pipeline
depths are customisable, for this work only N was varied. 32-
bit fixed-point data was used here, along with DSP multipliers
and BRAMs for memories. 15-stage multipliers and single-
stage accumulators were experimentally revealed to be ideal.
C. Error Detection
Before a multiplication begins, rows of B are fetched
in turn in order to precompute their checksums. This is
In
pu
t
B
R
A
M
2N×
DN
D
N
Ch
ec
ks
u
m
ge
n
er
at
io
n
lo
gi
c
A
′
D
D
(N
+
1
)
B
′
b
× + b
b
× + b
.
.
.
.
.
.
.
.
.
× + b
Q
′
D
(N
+
1
)
O
u
tp
u
t
B
R
A
M
(N + 1)×
D(N + 1)
Ch
ec
ks
u
m
v
er
ifi
ca
tio
n
lo
gi
c
D
(N
+
1
)
Fig. 2. Fault-tolerant datapath
accomplished by the adder, D-bit register and row checksum
RAM shown in Figure 3. Once complete, the first row of A
is buffered into the DN -bit register, following which rows
of B′ are fed to the MACs for computation. A checksum
generation—accomplished by the adder and column checksum
RAM—occurs on-the-fly as its columns are accessed. This
process repeats until all rows of A have been consumed. Once
complete rows of Q′ have been written into the output BRAM,
they are fetched in turn in order to be verified in a similar
manner by the logic shown in Figure 4.
In
pu
t
BR
A
M
D
N
b
DN
N : 1
D b
+ b
2 : 1
b
Col. Σs
N ×D
b
Row Σs
N ×D
2 : 1
D
A
′
D
(N
+
1
)
B
′
Fig. 3. Checksum generation logic
Q
′
D
(N
+
1
)
(N + 1) : 1
D
b + b =
· · ·
· · ·
N
+
1
R
o
w
Σ
s
O
K
b b
+
Co
l.
Σ
s
(N + 1)×D
b =
· · ·
· · ·
N
+
1
C
o
l.
Σ
s
O
K
Fig. 4. Checksum verification logic
D. Fault Location
Had the multiplication in Section II-B resulted in
Q
′
=
(21 22 41
45 50 93
63 72 134
)
,
(
19 23 41
43 51 93
62 73 134
)
or
(
19 22 43
43 50 95
62 72 135
)
instead, the positioning of incorrect checksum values would
have revealed location information regarding the MACs that
caused the errors. Each of these three cases is synonymous
with a single MAC register’s least-significant bit experiencing
a stuck-at-one fault. Elements that have been calculated incor-
rectly are shown in bold, while italics mark error-indicating
checksum values. Note that column checksum mismatches
relate one-to-one with faulty MACs, since each MAC is re-
sponsible for computing exactly one output column’s elements.
Simultaneous faults occurring both within an individual MAC
and across multiple MACs would yield equally informative
results: a single column checksum mismatch in the former
case and multiple in the latter.
E. Fault Avoidance
While our previously published matrix multiplication accel-
erator made use of additional logic to dynamically reallocate
data to the MACs in order to avoid faults, here we achieve the
same end result with partial routing reconfiguration. During
accelerator runs in which at least one error is detected, fault
location data is sent back to the controlling driver in order to
facilitate corrective action. Based upon the locations of faults
observed and, conversely, the locations of remaining healthy
MACs, one or more rounds of routing reconfiguration followed
by multiplier reruns, together called ‘corrective runs,’ can be
performed in order to re-establish accurate operation.
At compile-time, routing configurations representing differ-
ent amounts of data-shifting are synthesised along with the rest
of the design, which remains static. Nets B′ and Q′—shown
in Figure 2—are broken and routed via a single reconfigurable
partition, which then dictates the data connections on both the
input and output sides of the accelerator’s datapath. Figure
5 shows the configurations available for the multiplier when
N = 2. Circled numbers represent MACs, while the input
and output halves of the reconfigurable partition are shown as
dashed rectangles. In each case, the output shifting arrange-
ment mirrors that on the input side.
1
2
3
0-place shift
1
2
3
1-place shift
1
2
3
2-place shift
Fig. 5. Routing configurations available when N = 2
When routing reconfiguration is required, the driver initi-
ates a partial bitstream transfer, via the processor configuration
access port, from DRAM to the FPGA fabric. In order to lower
the total number of configurations required, only configurations
with equal-place shifting per MAC are generated. The number
of configurations stored for each accelerator is therefore N+1.
Our driver currently supports two levels of safety for error
correction. When operating in the safer mode, all incorrectly
computed columns of Q′ are recalculated, after which check-
sum verification is repeated to confirm successful correction. In
the less safe mode, the ABFT mechanism is essentially turned
off: the (N+1)th MAC becomes a usable spare and the output
is assumed to be accurate once corrective runs complete. As a
consequence of this, faults that affect only the (N+1)th MAC
are ignored in the less safe mode. The choice made between
these modes in a real application would be based upon the
likelihood of additional faults developing in different MACs
during the time it takes to complete a corrective run. Note that
re-transfer of input data and input checksum regeneration are
not required in either mode.
Figure 6 demonstrates the application of routing reconfig-
uration in order to avoid a single faulty MAC—labelled 2—
when N = 2. Intuitively, one corrective run is required to
overwrite the second column’s elements. A single-place shift
allows MAC 3 to perform the recalculation required.
1
2
3
Step 1
1
2
3
Step 2
Fig. 6. Steps for avoidance of a single faulty MAC when N = 2
In cases of multiple faults, differing amounts of data-
shifting are required. This is exemplified in Figure 7, in which
six different combinations of double-fault locations are shown
for N = 4. In the three leftmost cases, one single-place shift is
required, while in the three rightmost cases, one double-place
shift is required. Curved arrows represent the reallocation of
resources necessary during a corrective run.
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1-place shift required
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
2-place shift required
Fig. 7. Example double-fault locations requiring differing routing reconfig-
urations when N = 4
Intuition may suggest that the number of corrective runs
required is only dependant upon the ratio of faulty to healthy
MACs. When N ∈ {2, 4}, this is indeed true, but for N ≥
8 the situation is more complicated since there are cases in
which a configuration with an equal-place shift per MAC can
no longer match all faulty MACs to remaining healthy ones.
In Figure 8, where N = 8, six combinations of quadruple-
fault locations are shown. In the three leftmost cases, only a
single corrective run is required; in the three rightmost cases,
however, two are needed: resource reallocations which cannot
be performed in the first run are represented by dashed lines.
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1 corrective run needed
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
2 corrective runs needed
Fig. 8. Example quadruple-fault locations requiring differing numbers of
routing reconfigurations when N = 8
IV. EXPERIMENTS AND RESULTS
A. Hardware Overheads
All designs were synthesised with version 14.7 of Xil-
inx’s ISE toolchain. Table II contains the raw resource usage
figures obtained for all implementations—including the fault-
intolerant, fault-tolerant via additional logic from our previous
work and fault-tolerant via DPR versions of the accelerator—
per resource type. Percentages of the total number of each of
these resources for the target device are also included, along
with a mean of the individual proportions to give an indication
of the overall resource utilisation. The timing model fmax of
each design is also given. Figure 9 presents a visual summary
of the combined resource usage data.
TABLE II. RAW RESOURCE USAGE AND fmax FOR VARYING N
N
Fault Registers LUTs BRAMs DSPs Total fmax
avoidance resources (MHz)
2
None 210 239 2 6 1.02% 90.204(0.197%) (0.449%) (0.714%) (2.73%)
Logic1 597 665 5 9 1.92% 88.992(0.561%) (1.25%) (1.79%) (4.09%)
DPR2 406 711 5 9 1.90% 95.712(0.382%) (1.34%) (1.79%) (4.09%)
4
None 406 441 8 12 2.38% 80.103(0.382%) (0.829%) (2.86%) (5.45%)
Logic 945 855 11 15 3.31% 88.168(0.888%) (1.61%) (3.93%) (6.82%)
DPR 620 898 11 15 3.25% 82.102(0.583%) (1.69%) (3.93%) (6.82%)
8
None 794 604 16 24 4.63% 76.941(0.746%) (1.14%) (5.71%) (10.9%)
Logic 1625 2037 19 27 6.10% 77.042(1.53%) (3.83%) (6.79%) (12.3%)
DPR 1036 1375 19 27 5.65% 85.918(0.974%) (2.58%) (6.79%) (12.3%)
16
None 1566 613 30 48 8.79% 50.495(1.47%) (1.15%) (10.7%) (21.8%)
Logic 2970 1674 33 51 10.2% 56.850(2.79%) (3.15%) (11.8%) (23.2%)
DPR 1874 2273 33 51 10.3% 52.062(1.76%) (4.27%) (11.8%) (23.2%)
32
None 3105 2115 58 96 17.8% 58.156(2.92%) (3.98%) (20.7%) (43.6%)
Logic 5643 4203 61 99 20.0% 55.857(5.30%) (7.90%) (21.8%) (45.0%)
DPR 3675 4363 61 99 19.6% 53.101(3.45%) (8.20%) (21.8%) (45.0%)
10
20
30
40
50
60
70
80
90
2 4 8 16 32
%
in
cr
ea
se
N
Combined resource usage
Logic
DPR
Fig. 9. Combined resource usage overhead for varying N versus fault-
intolerant hardware
With the exception of N = 16, it is clear from Figure
9 that the DPR-shifting accelerator betters its logic-shifting
counterpart for overall resource usage across the range of N
tested. Since BRAM and DSP usage are identical between
the two versions, register and LUT counts are accountable
for all differences in utilisation. As expected, LUT overhead
for the DPR design decreases proportionally as N increases
1Previous approach [11]
2Proposed approach
thanks to the elimination of the circular shifters present in the
logic-shifting version. Conversely, register overhead tends to
increase slightly; this is due to pipelining registers inserted
into the lengthened paths that pass through the reconfigurable
region, the number of which increases linearly with N . For our
largest-tested design, N = 32, the DPR-shifting accelerator
achieved an overall area overhead of 10.1%—17.7% lower
than its logic-shifting equivalent. Between these two fault-
tolerant designs, LUT overhead decreased by 77.5% while
register overhead increased by 7.7%. fmax changes are not
well correlated, likely due to the stochastic nature of the
placement and routing tools used, although decreases between
the logic- and DPR-shifting designs due to path-lengthening
are seen for larger N .
B. Software Overheads
From a software perspective, the primary overhead of
our fault tolerance strategy is partial bitstream storage. Since
accelerator data and bitstream transfers, as well as accelerator
runs, are interrupt-driven, their hits on CPU performance are
negligible. Table III summarises the DRAM storage require-
ments for each value of N tested. The size of each partial
bitstream is given along with the total storage requirement for
that value of N . The memory occupation is also expressed,
for each N , as a proportion of the DRAM available (512MB)
on the development board used.
TABLE III. BITSTREAM STORAGE REQUIREMENTS FOR VARYING N
N
Bitstream size (kB)
Each Total
2 15.2 45.7 (0.00871%)
4 29.4 147 (0.0281%)
8 43.6 393 (0.0749%)
16 87.1 1480 (0.282%)
32 158 5220 (0.995%)
C. Performance Impacts
Testing was performed on the hardware in order to measure
its impact upon performance under a number of conditions.
Table IV summarises the results of all of these performance
tests. Each test was completed 10,000 times; the mean of
these runs is given in all cases. Prior to each test, uniformly
distributed random input data was generated to form A and B.
Execution times were measured using a cycle-accurate ARM
timer peripheral. In all cases, the FPGA fabric was clocked
at 50MHz. Included in Table IV are execution times for the
fault-intolerant multiplier, the fault-tolerant via data-shifting
accelerator and our DPR-enabled version running in both of its
operating modes. Where appropriate, latency increases relative
to the equivalently sized fault-intolerant design are given for
comparison. Execution times are given for the occurrences
of singular and double MAC failures—the former for both
the data-shifting and DPR hardware, and the latter for the
DPR version only. Permanent faults were emulated through the
targetted inversion of a single accumulator output bit within
either one or two MACs per run, with fault locations also
randomly chosen. Plots of the latency increases over the fault-
intolerant hardware under fault-free, singly and doubly faulty
conditions are given in Figure 10.
The results show that, for all N > 4, the DPR-shifting
accelerator outperforms the logic-shifting version under nor-
mal, fault-free operation as well as that in the presence of
a single failure. The lower penalties seen during fault-free
operation are due to the moving of the checksum verification
TABLE IV. RAW PERFORMANCE FIGURES FOR VARYING N
N
Fault Execution time (µs)
avoidance Fault-free Single failure Double failure
2
None 254 – –
Logic 272 (+7.09%) 300 (+18.1%) –
DPR (less safe) 280 (+10.2%) 366 (+44.1%) 451 (+77.6%)DPR (more safe) 392 (+54.3%) 477 (+87.8%)
4
None 314 – –
Logic 339 (+7.96%) 398 (+26.8%) –
DPR (less safe) 351 (+11.8%) 448 (+42.7%) 544 (+73.2%)DPR (more safe) 486 (+54.8%) 581 (+85.0%)
8
None 348 – –
Logic 546 (+56.9%) 712 (+105%) –
DPR (less safe) 422 (+21.3%) 557 (+60.1%) 690 (+98.3%)DPR (more safe) 631 (+81.3%) 764 (+120%)
16
None 497 – –
Logic 1350 (+172%) 1910 (+284%) –
DPR (less safe) 710 (+42.9%) 982 (+97.6%) 1254 (+152%)DPR (more safe) 1195 (+140%) 1467 (+195%)
32
None 3100 – –
Logic 4510 (+45.5%) 6600 (+113%) –
DPR (less safe) 3860 (+24.5%) 4680 (+51.0%) 5490 (+77.1%)DPR (more safe) 5440 (+75.5%) 6260 (+102%)
logic from the input to the output side of the output BRAM,
allowing return data transfers to start (and end) sooner than
they had previously, while gains under single failure mode are
realised for larger N as reconfiguration times proportionately
fall. The relationship between the performance plots for the
DPR-shifting version working in its two modes demonstrates
the near-fixed performance cost paid by operating more safely.
The trend-reversal seen on all plots after N = 16 can be at-
tributed to data transfer throttling: once N passes 16, memory
copies begin to dominate accelerator execution for proportional
runtime. Performance impacts arising from the use of partial
reconfiguration are negligible due to the bitstreams’ small size
and infrequent application per accelerator run. For our largest-
tested design, N = 32, the DPR-shifting accelerator incurred
a 24.5% latency penalty under fault-free operation—46.1%
lower than its logic-shifting equivalent.
D. Fault Observability
In order to assess the fault observability of our chosen
detection method, fault injection simulations were performed
to ascertain the hardware’s ability to correctly detect and locate
the faults. Detectable faults are those that result in one or
more checksum mismatches—in one or more rows, columns or
both—while those that are locatable cause mismatches within
the columns corresponding to the MACs they have affected.
For each combination of N and data width ∈
{2, 4, 8, 16, 32}, the following test steps were completed
400,000 times for both single and double faulty MAC em-
ulation. The results of this testing are presented in Figure 11.
1) Uniformly distributed random input data was gener-
ated for each element of A and B, with checksums
calculated to form A′ and B′ .
2) During each accumulation step of the subsequent
multiplication, a single bit in either one (for single
fault injection) or two (for double) columns—also
randomly selected—was held high to emulate either
one or two stuck-at one MAC register output bits.
3) Once complete, output checksums were verified:
where all were found to be correct, the fault was
recorded as undetected; where either all (for single
faults) or all but one (for double) column checksums
were found to be correct, it was unlocated.
For single fault injection, in all cases except for N = 2,
data width = 2, the proportion of undetected faults dropped
0
20
40
60
80
100
120
140
160
180
2 4 8 16 32
La
te
n
cy
in
cr
ea
se
(%
)
N
Fault-free
Logic
DPR (less safe)
DPR (more safe)
0
50
100
150
200
250
300
2 4 8 16 32
La
te
n
cy
in
cr
ea
se
(%
)
N
Single failure
Logic
DPR (less safe)
DPR (more safe)
60
80
100
120
140
160
180
200
2 4 8 16 32
La
te
n
cy
in
cr
ea
se
(%
)
N
Double failure
DPR (less safe)
DPR (more safe)
Fig. 10. Performance impact for varying N versus fault-intolerant hardware
0
0.5
1
1.5
2
2.5
3
2 4 8 16 32
%
Data width (bits)
Undetected single faults
N = 2
N = 4
N = 8
N = 16
N = 32
0
5
10
15
20
25
30
2 4 8 16 32
%
Data width (bits)
Unlocated single faults
N = 2
N = 4
N = 8
N = 16
N = 32
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2 4 8 16 32
%
Data width (bits)
Undetected double faults
N = 2
N = 4
N = 8
N = 16
N = 32
0
1
2
3
4
5
6
7
8
9
2 4 8 16 32
%
Data width (bits)
Unlocated double faults
N = 2
N = 4
N = 8
N = 16
N = 32
Fig. 11. Fault observability for varying N , data width
off with both N and data width. For N ≥ 16, undetectable
fault proportions fell below 0.1% for all data widths and, for
N = 32, undetectable faults ceased to be encountered. As
expected, proportions of unlocatable faults were higher than
those that were undetectable due to the lack of redundancy in
checksums used for location. In all cases except for N = 2,
data width = 2, however, the proportion of unlocated faults
observed dropped with increasing data width for each value
of N . For larger N , the locatability of faults is largely
independent of N itself.
Largely similar trends were seen for double fault injection
testing. The rates of both undetected and unlocated faults
were all lower, however, for each combination of N and data
width. This is expected of undetected faults since the likelihood
of errors being masked in multiple columns simultaneously
decreases as the number of affected columns increases. The
proportions of unlocatable double faults encountered were
again significantly higher than those which were undetectable
but, for all cases except for N = 2, data width = 2, dropped
off with increasing data width for all N .
V. CONCLUSION
In this paper, we explored the combined application of
algorithm-based fault tolerance and dynamic partial reconfigu-
ration to a benchmark hardware accelerator—a parallel matrix
multiplier—with the goal of achieving a robust, low-overhead
design capable of detecting faults within itself and taking
corrective action in order to re-establish accurate operation us-
ing runtime routing reconfiguration. Our largest implemented
ABFT- and DPR-protected design, for the multiplication of
two 32×32 matrices, was found to consume 10.1% more area
and incur a 24.5% execution time penalty over its equivalent,
unprotected design during fault-free operation.
Our future work will focus upon refinements to the ac-
celerator design—including facilitating more comprehensive
pipelining and the application of error detection at lower levels
of precision—as well as the the generalisation of techniques
developed here to allow the protection of other operators. Ef-
forts will also be focussed upon methods of finer-grained fault
detection, likely to be performed during periods of temporary
downtime on known-faulty functional units, as well as the
exploration of the potential application of remote compilation
for fault correction.
REFERENCES
[1] E. Stott, P. Sedcole, and P. Y. K. Cheung, “Fault tolerance and relia-
bility in field-programmable gate arrays,” IET Computers and Digital
Techniques, vol. 4, no. 3, 2010.
[2] M. Abramovici, C. Strond, C. Hamilton, S. Wijesuriya, and V. Verma,
“Using roving STARs for on-line testing and diagnosis of FPGAs in
fault-tolerant applications,” in International Test Conference, 1999.
[3] S. Dutt, V. Verma, and V. Suthar, “Built-in-self-test of FPGAs with
provable diagnosabilities and high diagnostic coverage with application
to online testing,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 27, no. 2, 2008.
[4] V. Suthar and S. Dutt, “Efficient on-line interconnect testing in FPGAs
with provable detectability for multiple faults,” in Design, Automation
and Test in Europe (DATE), vol. 1, 2006.
[5] J. Lach, W. H. Mangione-Smith, and M. Potkonjak, “Low overhead
fault-tolerant FPGA systems,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 6, no. 2, 1998.
[6] F. Hanchek and S. Dutt, “Node-covering based defect and fault tolerance
methods for increased yield in FPGAs,” in International Conference on
VLSI Design, 1996.
[7] V. Lakamraju and R. Tessier, “Tolerating operational faults in cluster-
based FPGAs,” in International Symposium on Field Programmable
Gate Arrays (FPGA), 2000.
[8] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for
matrix operations,” IEEE Transactions on Computers, vol. C-33, no. 6,
1984.
[9] S.-J. Wang and N. K. Jha, “Algorithm-based fault tolerance for FFT
networks,” IEEE Transactions on Computers, vol. 43, no. 7, 1994.
[10] A. Jacobs, G. Cieslewski, and A. D. George, “Overhead and reliability
analysis of algorithm-based fault tolerance in FPGA systems,” in In-
ternational Conference on Field Programmable Logic and Applications
(FPL), 2012.
[11] J. J. Davis and P. Y. K. Cheung, “Datapath fault tolerance for paral-
lel accelerators,” in International Conference on Field-Programmable
Technology (FPT), 2013.
