Evaluation of Algorithm-Based Fault Tolerance for Machine Learning and Computer Vision under Neutron Radiation

Abstract

In the past decade, there has been a push for deployment of commercial-off-the-shelf (COTS) avionics due in part to cheaper costs and the desire for more performance. Traditional radiation-hardened processors are expensive and only provide limited processing power. With smaller mission budgets and the need for more computational power, low-cost and high-performance COTS solutions become more attractive for these missions. Due to the computational capacity enhancements provided by COTS technology, machine-learning and computer-vision applications are now being deployed on modern space missions. However, COTS electronics are highly susceptible to radiation environments. As a result, reliability in the underlying computations becomes a concern. Matrix multiplication is used in machine-learning and computer-vision applications as the main computation for decisions, making it a critical part of the application. Therefore, the large time and memory footprint of the matrix multiplication in machine-learning and computer-vision applications makes them even more susceptible to single-event upsets. In this thesis, algorithm-based fault tolerance (ABFT) is investigated to mitigate silent data errors in machine learning and computer vision. ABFT is a methodology of data error detection and correction using information redundancy contained in separate data structures from the primary data. In matrix multiplication, ABFT consists of storing checksum data in vectors separate from the matrix to use for error detection and correction. Fault injection into a matrix-multiplication kernel was performed prior to irradiation. Irradiation was then performed on the kernel under wide-spectrum neutrons at Los Alamos Neutron Science Center to observe the mitigation effects of ABFT. Fault injections targeted towards the general-purpose registers show a 48×48\times reduction in data errors using data-error mitigation with ABFT with a negligible change in run-time. Cross-section results from irradiation show a 5.3x improvement in reliability of using ABFT as opposed to no mitigation with a >99.9999 confidence level. The results of this experiment demonstrate that ABFT is a viable solution for run-time error correction in matrix multiplication for machine-learning and computer-vision applications in future spacecraft

    Similar works