slides

Achieving Low-overhead Fault Tolerance for Parallel Accelerators with Dynamic Partial Reconfiguration

Abstract

© 2014 Technical University of Munich (TUM).While allowing for the fabrication of increasingly complex and efficient circuitry, transistor shrinkage and count-per-device expansion have major downsides: chiefly increased variation, degradation and fault susceptibility. For this reason, design-time consideration of fault tolerance will have to be given to increasing numbers of electronic systems in the future to ensure yields, reliabilities and lifetimes remain acceptably high. Many commonly implemented operators are suited to modification resulting in datapath error detection capabilities with low area overheads. FPGAs are uniquely placed to allow further area savings to be made when incorporating fault avoidance mechanisms thanks to their dynamic reconfigurability. In this paper, we examine the practicalities and costs involved in implementing hardware-software fault tolerance on a test platform: a parallel matrix multiplication accelerator in hardware, with controller in software, running on a Xilinx Zynq system-on-chip. A combination of bolt-on error detection logic and software-triggered routing reconfiguration serve to provide low-overhead datapath fault tolerance at runtime. Rapid yet accurate fault diagnoses along with low hardware (area), software (configuration storage) and performance penalties are achieved

    Similar works