1 research outputs found

    Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters

    No full text
    Recently shifts in high-performance computing have increased the use of clusters built around cheap commodity processors. A typical cluster consists of individual nodes, containing one or several processors, connected together with a highbandwidth, low-latency interconnect. There are many benefits to using clusters for computation, but also some drawbacks, including a tendency to exhibit low Mean Time To Failure (MTTF) due to the sheer number of components involved. Recently, a number of fault-tolerance techniques have been proposed and developed to mitigate the inherent unreliability of clusters. These techniques, however, fail to address the issue of detecting non-obvious faults, particularly Byzantine faults. At present, effectively detecting Byzantine faults is an open problem. We describe the operation of ByzwATCh, a module for run-time detecting Byzantine hardware errors as part of the Charm++ parallel programming framework
    corecore