To maximize the performance and energy efficiency of Spiking Neural Network
(SNN) processing on resource-constrained embedded systems, specialized hardware
accelerators/chips are employed. However, these SNN chips may suffer from
permanent faults which can affect the functionality of weight memory and neuron
behavior, thereby causing potentially significant accuracy degradation and
system malfunctioning. Such permanent faults may come from manufacturing
defects during the fabrication process, and/or from device/transistor damages
(e.g., due to wear out) during the run-time operation. However, the impact of
permanent faults in SNN chips and the respective mitigation techniques have not
been thoroughly investigated yet. Toward this, we propose RescueSNN, a novel
methodology to mitigate permanent faults in the compute engine of SNN chips
without requiring additional retraining, thereby significantly cutting down the
design time and retraining costs, while maintaining the throughput and quality.
The key ideas of our RescueSNN methodology are (1) analyzing the
characteristics of SNN under permanent faults; (2) leveraging this analysis to
improve the SNN fault-tolerance through effective fault-aware mapping (FAM);
and (3) devising lightweight hardware enhancements to support FAM. Our FAM
technique leverages the fault map of SNN compute engine for (i) minimizing
weight corruption when mapping weight bits on the faulty memory cells, and (ii)
selectively employing faulty neurons that do not cause significant accuracy
degradation to maintain accuracy and throughput, while considering the SNN
operations and processing dataflow. The experimental results show that our
RescueSNN improves accuracy by up to 80% while maintaining the throughput
reduction below 25% in high fault rate (e.g., 0.5 of the potential fault
locations), as compared to running SNNs on the faulty chip without mitigation.Comment: Accepted for publication at Frontiers in Neuroscience - Section
Neuromorphic Engineerin