Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators by Davis, JJ & Cheung, PYK
Reduced-precision Algorithm-based Fault
Tolerance for FPGA-implemented Accelerators
James J. Davis and Peter Y. K. Cheung
Imperial College London, London, SW7 2AZ, United Kingdom
{james.davis06, p.cheung}@imperial.ac.uk
Abstract. As the threat of fault susceptibility caused by mechanisms in-
cluding variation and degradation increases, engineers must give growing
consideration to error detection and correction. While the use of common
fault tolerance strategies frequently causes the incursion of significant
overheads in area, performance and/or power consumption, options ex-
ist that buck these trends. In particular, algorithm-based fault tolerance
embodies a proven family of low-overhead error mitigation techniques
able to be built upon to create self-verifying circuitry.
In this paper, we present our research into the application of algorithm-
based fault tolerance (ABFT) in FPGA-implemented accelerators at re-
duced levels of precision. This allows for the introduction of a previously
unexplored tradeoff: sacrificing the observability of faults associated with
low-magnitude errors for gains in area, performance and efficiency by re-
ducing the bit-widths of logic used for error detection. We describe the
implementation of a novel checksum truncation technique, analysing its
effects upon overheads and allowed error. Our findings include that bit-
width reduction of ABFT circuitry within a fault-tolerant accelerator
used for multiplying pairs of 32 × 32 matrices resulted in the reduc-
tion of incurred area overhead by 16.7% and recovery of 8.27% of timing
model fmax. These came at the cost of introducing average and maxi-
mum absolute output errors of 0.430% and 0.927%, respectively, of the
maximum absolute output value under transient fault injection.
1 Introduction
ABFT relies upon the augmentation of data with additional information—
checksums formed from that data—to provide post-operation verification of
results with low overheads compared to alternatives including modular redun-
dancy. While previous fixed-point ABFT-related work has assumed all data and
checksums to be n-bit integer (i.e. modulo-2n), it is possible to break this rela-
tionship and consider data and checksum precision independently. By making
informed decisions regarding exactly which information to discard when forming
and manipulating checksums, the incurred overheads can be reduced at the cost
of accepting some data error tolerance. The methods introduced in this work
lend themselves to FPGAs thanks to their efficient simultaneous implementa-
tion of multiple arbitrary-precision datapaths. Here, as a case study for the
investigation into reduced-precision ABFT, we use hardware-accelerated matrix
multiplication: a benchmark for which the ABFT operation is straightforward
yet that is representative of commonly hardware-accelerated operations.
The novel contributions of this work are: (1) the first consideration of distinct
data and checksum bit-widths within ABFT-protected operations, which we call
reduced-precision or RP -ABFT, (2) an implementation of circuitry incorporating
RP-ABFT for resilience against hardware faults, (3) analysis of the costs and
benefits of applying RP-ABFT at various levels of precision and (4) insight into
the fault tolerability of RP-ABFT.
2 Application-level Fault Tolerance
Tailoring fault tolerance to particular applications can facilitate drastic overhead
reduction versus general-purpose methods. ABFT represents a methodology for
achieving such reduction while maintaining high fault detectability. A subclass of
linear algebra operations exists that can be protected by ABFT; amongst these
are matrix operations (multiplication, addition, LU decomposition, etc.) [4] and
Fourier transformations [7]. Linear filtering operations can be protected when
considered in state-space form [4]. These operations are also highly suited to
hardware acceleration thanks to their inherent parallelism. Beyond low area
overhead, ABFT has two further key advantages: (1) its application requires no
fundamental changes to the datapaths used for performing mathematical oper-
ations and (2) output and error-indicating data are produced simultaneously.
While ABFT has traditionally been used to protect fixed-point operations,
the methods are compatible with floating-point arithmetic as well. Of particular
relevance to this work are the errors introduced by floating-point operations,
which necessitate error bounding to distinguish them from those caused by other
mechanisms [6]. Recent work [1] sought to lower the required bounds in a GPU-
accelerated floating-point benchmark by analysing input data a priori.
ABFT-protected accelerators implemented in FPGAs have been the focus of
several recent publications. Jacobs et al. implemented algorithmic protection of
several matrix multiplication architectures [5]. ABFT was called upon for error
detection of the same operator in more recent work, with resource reallocation
performed using additional logic [2] and dynamic partial reconfiguration [3] in
order to avoid faulty components at runtime. Area overheads for accelerators
capable of multiplying pairs of 32 × 32 matrices were found to be 17.3% and
10.1%, respectively, therein. While error correction is not the focus of this work,
previously published fault avoidance strategies [2] [3] are directly compatible.
3 Principles of ABFT
The mechanics of ABFT checksumming are described here [4]. Any m× n data
matrixD can be supplemented with an additional row of column-wise checksums
to produce an (m+ 1)× n column checksum-encoded matrix Dc. The transfor-
mation, achieved with generation matrix Gc, is Dc = GcD =
(
Im×m
11×m
)
D.
Note that I is the identity matrix and 1 a vector of ones. Similarly, row-wise
checksums can be added within an additional column to form an m × (n + 1)
row checksum-encoded matrix Dr with generation matrix Gr by performing
Dr = DGr = D ( In×n 1n×1 ). An (m+1)× (n+1) full checksum-encoded matrix
Df can be produced by performing Df = GcDGr.
Following storage, transmission or computation that preserves the form of
checksum-encoded matrices, data integrity can be verified by producing a dis-
crepancy vector δ. Column- and row-wise discrepancy vectors can be produced,
using verification vectors vc and vr, by performing δc = vcDc = ( 11×m −1 )Dc
and δr = Drvr = Dr
(
1n×1
−1
)
. Non-zero elements indicate the presence, locations
and magnitudes of errors within checksum-encoded matrices. A full checksum-
encoded matrix Df can be verified by independently producing both δc and δr.
Consider A = ( 1 23 4 ) and B = (
5 6
7 8 ). For simplicity and clarity, we assume the
data matrices used to always be square with dimensions s×s, although this is not
a requirement. Since matrix multiplication is a checksum-preserving operation,
C = AB can be protected by forming Ac and Br as explained in Sect. 3 and
then performing Cf = AcBr. The transformations and subsequent computation
are shown in (1), with column- and row-wise checksums shown in red and blue,
respectively. The result’s corner element is shown in magenta to indicate that
it is both column- and row-wise checksum. Note that the data present in (1)’s
unprotected result is preserved in its protected result. The protected result can
be verified by calculating discrepancy vectors δc and δr, as shown in (2).
(
1 2
3 4
)(
5 6
7 8
)
=
(
19 22
43 50
)
→

1 23 4
4 6

(5 6 11
7 8 15
)
=

19 22 4143 50 93
62 72 134

 . (1)
(
1 1 −1
)19 22 4143 50 93
62 72 134

 = (0 0 0) ,

19 22 4143 50 93
62 72 134



 11
−1

 =

00
0

 . (2)
4 Principles of RP-ABFT
To reduce overheads while maintaining sensitivity to faults that cause high-
magnitude errors, truncation can be performed from the least significant bits
(LSBs) of data elements ‘upwards’ when forming and manipulating checksums.
In this paper, all input data elements are n-bit signed integers and we call
the number of bits of precision removed from each during checksum generation
the truncation width, represented by r. Output data elements are always 2n-
bit. We label input and output data elements within ABFT-protected matrix
multiplication as din and dout, respectively. csin and csout are input and output
checksums, while corner checksum csout, c is special, being formed exclusively
from csin elements. We use ∨(.) to represent maximum absolute value, while ǫ(.)
is the maximum absolute error introduced by truncation.
∨(din) is 2
n−1. The r-bit truncation of a din element, performed with bitwise
shifts as (din ≫ r)≪ r, is represented as ⌊din⌋r since rounding, for both positive
and negative values, is towards negative infinity. Note that ∨(⌊din⌋r) = ∨(din);
the maximum negative value, for which truncation by any 0 ≤ r < n will have no
effect, also represents the maximum absolute value. ǫ(⌊din⌋r) = 2
r−1. Each input
checksum element, csin, is formed from s independently truncated din elements.
∨(csin) and ǫ(csin) are therefore simply s2
n−1 and s(2r − 1), respectively.
Output checksum elements, csout, are comprised of s multiplied pairs of din
and csin. Since the din element used within each multiplication is not truncated,
it does not introduce error: this comes purely from each csin, so ǫ(csout) =
s∨(din)ǫ(csin) = s
2(2r − 1)2n−1. The corner output checksum element, csout, c,
is formed of s multiplied pairs of csin elements. Unlike for each csout element,
therefore, error can be introduced by both of the multiplicands within each prod-
uct. Consequently, ǫ(csout, c) = s
(
∨(csin)ǫ(csin) + ǫ(csin)∨(csin) + ǫ(csin)
2)
=
s3(2r − 1)(2n + 2r − 1).
5 Implementation
The datapath of the fault-tolerant matrix multiplication accelerator used in this
work is shown in Fig. 1. At its core lie s + 1 identical multiply-accumulators
(MACs), each responsible for calculating the values of elements between exactly
one column of output matrix Cf. All data is signed fixed-point, with n-bit input
elements and 2n-bit outputs. Wide—ns-bit input and 2n(s + 1)-bit output—
RAMs prevent starvation, allowing complete matrix rows to be accessed on a
cycle-by-cycle basis. When r = 0, the paths for Ac and Br are n + log2s bits
per element: this prevents overflow within the input checksums, allowing output
checksums to be valid up to the required 2n bits.
In
p
u
t
R
A
M
2s×
ns
n
s
C
h
e
ck
su
m
g
e
n
e
ra
to
r
Ac
n
+
m
a
x
(
lo
g
2
s
−
r
,
0
)
( n
+
m
a
x
(
lo
g
2
s
−
r
,
0
)
) (
s
+
1
)
Br
b
× + b
b
× + b
.
.
.
.
.
.
.
.
.
× + b
Cf
2
n
(
s
+
1
)
O
u
tp
u
t
R
A
M
(s + 1)×
2n(s + 1)
C
h
e
ck
su
m
v
e
ri
fi
e
r
2
n
(
s
+
1
)
Fig. 1. Datapath
Checksum generation and verification logic, shown in Fig. 2, serves to perform
the ABFT procedures described in Sect. 3. Rows of B are first fetched in turn
such that the checksums in Br can be calculated. The adder, narrow register
and csr RAM are used for this purpose. Multiplication proceeds thereafter: A’s
first row is stored in the wide register, then the rows of Br are presented in turn
to the MACs for computation. Ac, calculated using the adder and csc RAM as
an accumulator, occurs on-the-fly as its columns are consumed. These steps are
repeated until all rows of A have been accessed. On the output side, complete
rows of Cf are verified immediately after being stored in a similar manner to
generation; results are fed into shift registers for later analysis. Note that, when
r = 0, the right-shifters do not exist and, since no output error is tolerated, the
comparison logic shown in the dashed rectangle reduces to just two comparators.
I
n
p
u
t
R
A
M
n
s
b
ns s : 1
n b
≫ r
n − r
+ b
b
csc RAM
s × (n − r + log2s)
b
csr RAM
s × (n − r + log2s)
n
+
m
a
x
(
lo
g
2
s
−
r
,
0
)
Ac
( n
+
m
a
x
(
lo
g
2
s
−
r
,
0
)
) (
s
+
1
)
Br
O
u
t
p
u
t
R
A
M
2
n
(
s
+
1
)
(s + 1) : 1
2
n
b
≫
r 2
n
−
r
b + b −
b
|.| <
· · ·
· · ·
s
+
1
c
s
r
s
O
K
+
c
s
c
R
A
M
(s + 1)×
(2n − r)
b − |.| <
· · ·
· · ·
s
+
1
c
s
c
s
O
K
θ
θc
b
Fig. 2. Checksum generation and verification logic
When r > 0, output checksum error must be tolerated up to the levels
theorised in Sect. 4. Clearly, there is no reason to actually perform the left-
shifting shown in the explanation of the truncation procedure; for this reason,
error thresholds θ and θc for csout and csout, c elements, respectively, need to
be based upon, not equal to, ǫ(csout) and ǫ(csout, c). csout elements have their
widths reduced by r bits due to the right-shifter in Fig. 2’s checksum generation
logic; as a result, θ = ǫ(csout)2r =
s2(2r−1)2n−1
2r ≈ s
22n−1. csout, c elements, however,
are subject to magnitude reduction by both right-shifters, so θc =
ǫ(csout, c)
22r =
s3(2r−1)(2n+2r−1)
22r ≈ s
32n−r. Note that the per-element paths for Ac and Br are
each n+max(log2s− r, 0)-bit to optimally fit the single largest element.
All hardware shown in Figs. 1 and 2 was implemented in the programmable
logic portion of a Xilinx Zynq-7000 XC7Z020 system-on-chip. Supporting hard-
ware, formed of Xilinx IP cores, included BRAM and direct memory access
controllers for facilitating data transfer between BRAM and off-chip dynamic
RAM. One of the XC7Z020’s two hard ARM CPU cores was used as a controller
to trigger memory transfers and accelerator runs. The CPU is not integral to
the functionality of the developed hardware.
6 Area and Performance Overheads
Designs were compiled using Xilinx Vivado 2014.4 for each combination of s ∈
{2, 4, 8, 16, 32} and r ∈ {0, 4, 8, 12, 16, 20, 24}. A set of baseline designs without
ABFT protection was also produced, and n was 32 in all cases. Figure 3 shows
the total area—calculated as µ
(
LUT (%), FF (%), BRAM (%), DSP (%)
)
—
overhead versus the equivalently sized design without ABFT. The matrix size s
was limited to 32 by the FPGA targetted.
10
15
20
25
30
35
40
45
50
55
60
0 4 8 12 16 20 24 28
∆
to
ta
l
re
so
u
rc
es
(%
)
Truncation width r (bits)
Total resources
s = 2
s = 4
s = 8
s = 16
s = 32
−42
−40
−38
−36
−34
−32
−30
0 4 8 12 16 20 24 28
∆
f
m
ax
(%
)
Truncation width r (bits)
fmax
s = 2
s = 4
s = 8
s = 16
s = 32
Fig. 3. Resource usage and fmax vs unprotected design
Area overhead initially increased for r > 0 in all cases other than s = 32. This
was primarily due to the introduction of the subtractors shown in Fig. 2. Gains
were realised in the s = 16 case for r ≥ 8, r ≥ 16 in the s = 8 case and r ≥ 20
in the remaining two. The maximum area gain, again for s = 32 and r = 28,
was 23.8%. It should be noted that BRAM and DSP usage are independent of
r since truncation affects only the checksum generation and verification logic,
which is devoid of multipliers and contain only small, distributed memories.
The reported timing model fmax of each design was also recorded. Changes
versus the equivalently sized unprotected designs are captured in Fig. 3. To
overcome the effects of CAD noise, linear regressions are included for each plot,
shown as dashed lines. Note that the value of r chosen does not affect the (clock
cycle) latency of a design versus its standard ABFT equivalent, allowing fmax to
be used for performance comparison directly. fmax reductions begin significantly:
for s = 32 and r = 0, fmax dropped by 40.8%. Designs with r > 0 exhibited
relatively small performance improvements: for s = 32, a drop in frequency
impact of 7.23% was found. Although trends for smaller s are actually negative,
those for larger s are positive. This is a result of the lack of severe output
truncation and the introduction of additional logic. Nevertheless, frequency gains
were realised for larger designs, with s = 32 showing increasing gains for r 6= 28.
7 Fault Observability
Functional simulations were performed to assess the fault observability of the
proposed designs across the range of implementation variables used in Sect. 6
under the presence of both permanent and transient faults. The fault model
applied was that of individually targetted stuck-at-one accumulator bits. Such
faults were chosen since they are representative of a range of phenomena under
different conditions, e.g. worn transistors or bridged interconnects in the case
of permanent faults and register or memory upsets in the case of transients.
Accumulator outputs were manipulated since these components lie at the ends
of the datapaths of interest, reducing the probability of logical masking and
representing somewhat of a worst-case operating scenario. For each combination
of implementation variables, the following steps were repeated 1,024,000 times:
1. Generate two s × s matrices, A and B, and populate their elements with
n-bit signed integer data selected randomly from a uniform distribution.
2. Add checksumming to A and B to form Ac and Br as explained in Sect. 3.
3. Perform Cf = AcBr, element-wise modulo-2
2n.
4. Perform C ′f = AcBr, element-wise modulo-2
2n, with fault emulation:
– For a permanent fault, select a (column, bit) combination from a uniform
distribution. During all accumulation steps, force this bit high.
– For a transient fault, select a (row, column, step, bit) combination from
a uniform distribution. Force this bit high during computation.
5. If comparison of data and checksums within Cf and C
′
f reveals that the fault
was missed, record the maximum absolute error of C ′f’s data elements.
Figure 4 shows the means of errors encountered within results flagged as
false negative; those that were missed. Assuming that unmissed errors are able
to be corrected, Fig. 4’s results therefore represent the average expected worst-
element errors introduced by RP-ABFT. They indicate that RP-ABFT allows
only relatively small errors to propagate, particularly when r is small. It is around
the first inflection seen in the permanent fault plots in Fig. 4 that the detection
logic starts to become ineffective. Consider s = 32 in Sect. 4. Setting ǫ(csout, c) =
263, i.e. ∨(dout) for n = 32, reveals that at r ≈ 16 corner checksums cease to be
effective. Similarly, setting ǫ(csout) = 2
63 for the same s and n shows that all
checksumming is rendered useless at r ≈ 22.
225
230
235
240
245
250
255
260
265
0 4 8 12 16 20 24 28
µ
(M
A
E
)
Truncation width r (bits)
Permanent faults
s = 2
s = 4
s = 8
s = 16
s = 32
225
230
235
240
245
250
255
260
0 4 8 12 16 20 24 28
µ
(M
A
E
)
Truncation width r (bits)
Transient faults
s = 2
s = 4
s = 8
s = 16
s = 32
Fig. 4. Means of maximum absolute errors encountered within false negative results
8 Conclusion
In this paper, we introduced reduced-precision algorithm-based fault tolerance,
or RP-ABFT. RP-ABFT with LSB-first checksum truncation was theorised and
implemented in hardware using matrix multiplication as a case study. Our results
showed that meaningful overhead reduction can be achieved by sacrificing some
fault observability. Our future work on RP-ABFT will explore the false positive-
to-false negative tradeoffs achievable through the manipulation of output error
threshold values. We will also explore enhancements to the checksumming logic,
particularly for performance, as well as output-only truncation to introduce ad-
ditional tradeoff data points.
The authors acknowledge the support of the EPSRC-funded PRiME project
(http://www.prime-project.org); grant number EP/K034448/1.
References
1. Braun, C., et al.: A-ABFT: Autonomous Algorithm-based Fault Tolerance for Ma-
trix Multiplications on Graphics Processing Units. In: International Conference on
Dependable Systems and Networks (DSN) (2014)
2. Davis, J.J., et al.: Datapath Fault Tolerance for Parallel Accelerators. In: Interna-
tional Conference on Field-Programmable Technology (FPT) (2013)
3. Davis, J.J., et al.: Achieving Low-overhead Fault Tolerance for Parallel Accelera-
tors with Dynamic Partial Reconfiguration. In: International Conference on Field-
programmable Logic and Applications (FPL) (2014)
4. Huang, K.H., et al.: Algorithm-based Fault Tolerance for Matrix Operations. IEEE
Transactions on Computers C-33(6) (1984)
5. Jacobs, A., et al.: Overhead and Reliability Analysis of Algorithm-based Fault Toler-
ance in FPGA systems. In: International Conference on Field Programmable Logic
and Applications (FPL) (2012)
6. Rexford, J., et al.: Algorithm-based Fault Tolerance for Floating-point Operations
in Massively Parallel Systems. In: International Symposium on Circuits and Systems
(ISCAS). vol. 2 (1992)
7. Wang, S.J., et al.: Algorithm-based Fault Tolerance for FFT Networks. IEEE Trans-
actions on Computers 43(7) (1994)
