COMET: Communication-optimised multi-threaded error-detection technique

Abstract

© 2016 ACM. Relentless technology scaling has made transistors more vulnerable to soft, or transient, errors. To keep systems robust against these, current error detection techniques use different types of redundancy at the hardware or the software level. A consequence of these additional protection mechanisms is that these systems tend to become slower. In particular, software error-detection techniques degrade performance considerably, limiting their uptake. This paper focuses on software redundant multi-threading error detection, a compiler-based technique that makes use of redundant cores within a multi-core system to perform error checking. Implementations of this scheme feature two threads that execute almost the same code: the main thread runs the original code and the checker thread executes code to verify the correctness of the original. The main thread communicates the values that require checking to the checker thread to use in its comparisons. We identify a major performance bottleneck in existing schemes: poorly performing inter-core communication and the generated code associated with it. Our study shows this is a major performance impediment within existing techniques since the two threads require extremely fine-grained communication, on the order of every few instructions. We alleviate this bottleneck with a series of code generation optimisations at the compiler level. We propose COMET (Communication-Optimised Multi-threaded Error-detection Technique), which improves performance across the NAS parallel benchmarks by 31.4% (on average) compared to the state-of-the-art, without affecting fault-coverage

    Similar works