2 research outputs found

    Resilient gossip-inspired all-reduce algorithms for high-performance computing - Potential, limitations, and open questions

    Get PDF
    We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.This work has been partially funded by the Spanish Ministry of Science and Innovation [contract TIN2015-65316]; by the Government of Catalonia [contracts 2014-SGR-1051, 2014-SGR-1272]; by the RoMoL ERC Advanced Grant [grant number GA 321253] and by the Vienna Science and Technology Fund (WWTF) through project ICT15-113.Peer ReviewedPostprint (author's final draft

    Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

    No full text
    Abstract-Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the pushflow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization
    corecore