Reducing overheads for fault-tolerant datapaths with dynamic partial reconfiguration by Davis, J & Cheung, PYK
Reducing Overheads for Fault-tolerant Datapaths with
Dynamic Partial Reconfiguration
James J. Davis and Peter Y. K. Cheung
Department of Electrical and Electronic Engineering
Imperial College London
London, SW7 2AZ, United Kingdom
E-mail: {james.davis06, p.cheung}@imperial.ac.uk
Abstract—As process scaling and transistor count inflation
continue, silicon chips are becoming increasingly susceptible
to faults. Although FPGAs are particularly susceptible to the
effects of such scaling, their runtime reconfigurability offers
unique opportunities for fault-tolerance. This work presents
an application combining algorithmic-level fault detection with
dynamic partial reconfiguration to allow faults manifested
within its datapath at runtime to be circumvented.
Keywords-Algorithm-based fault-tolerance, dynamic partial
reconfiguration, error recovery, matrix multiplication.
I. INTRODUCTION
By tailoring fault-tolerant hardware to the application it
protects, it is possible to reduce overheads without caus-
ing correspondingly large decreases in fault vulnerability.
Algorithm-based fault tolerance (ABFT), applicable to a
wide range of linear algebra operations [1] represent es-
tablished methods for achieving such ends. Originally con-
ceived for many-core applications, ABFT has been imple-
mented in FPGAs to realise hardened matrix multiplication
designs [2]. Our recent work [3] used the same operator
as a case study for the implementation of a complete
fault tolerance system, using ABFT for error detection and
additional logic for subsequent fault avoidance.
Dynamic partial reconfiguration (DPR) has been
explored as a means for facilitating runtime fault
avoidance—exploiting reconfigurability to work around
faulty components—in the past. Such non-application
specific schemes, however, suffer from either high detection
latency or low fault tolerability [4].
Here, we revisit ABFT-protected matrix multiplication,
using DPR in place of additional logic to route around faults.
II. ERROR DETECTION AND FAULT AVOIDANCE
Central to this work is a hardware matrix multiplier,
unrolled to perform computations on entire rows in parallel,
implemented on a Xilinx Zynq system-on-chip. The check-
sum generation and verification required for ABFT [3] is
performed by ‘bolt-on’ logic added to the input and output
sides of the datapath, while the datapath itself is expanded
by one multiply-accumulator (MAC) to mirror the matrices’
expansion. What is most attractive about ABFT in this case
is that the error detection circuitry represents a one-off fixed
cost; the proportional area overhead incurred through its
addition decreases as the problem size increases.
Faults that occur within the datapath can be attributed
to particular MACs thanks to the fact that each is used to
calculate the elements of exactly one output matrix column.
Multiple faults result in checksum mismatch combinations
that are able to identify faulty MACs simultaneously.
Since identical parallel MACs are used to perform com-
plete multiplications, the mapping of those units to the out-
put matrix columns they represent is inconsequential to the
results. For this reason, once one or more are diagnosed as
faulty, remaining healthy units can be substituted—multiple
times per operation if necessary—for them. Such action
has the effect of reducing parallelism to maintain accurate
computation at the expense of increased runtime.
Fed with ABFT hardware-obtained fault location informa-
tion, our system uses DPR to apply appropriate partial bit-
streams, each representing a different routing configuration
for the input and output side of the datapath, necessary to
route data around faulty MACs. Since relatively few nets are
affected, the partial bitstreams are small and consequently
take only tens of to a few hundred µs to apply.
III. CONCLUSION
Our experiments have shown the combination of ABFT
and DPR to be effective in creating robust hardware with
low overheads. When compared with our previous work, the
replacement of dedicated rerouting logic with DPR resulted
in an area overhead reduction from 17.3% to 9.01% for our
largest tested design. In the future, our ABFT-based work
will focus upon its application at differing levels of precision
and expansion to additional operators.
REFERENCES
[1] K.-H. Huang and J. Abraham, “Algorithm-based fault tolerance
for matrix operations,” IEEE Transactions on Computers, vol.
C-33, no. 6, pp. 518–528, 1984.
[2] A. Jacobs, G. Cieslewski, and A. George, “Overhead and
reliability analysis of algorithm-based fault tolerance in FPGA
systems,” in International Conference on Field Programmable
Logic and Applications (FPL), 2012, pp. 300–306.
[3] J. Davis and P. Cheung, “Datapath fault tolerance for parallel
accelerators,” in Field-Programmable Technology (FPT), 2013
International Conference on, Dec 2013, pp. 366–369.
[4] E. Stott, P. Sedcole, and P. Cheung, “Fault tolerance and reli-
ability in field-programmable gate arrays,” Computers Digital
Techniques, IET, vol. 4, no. 3, pp. 196–210, May 2010.
