762 research outputs found

    Epidemic failure detection and consensus for extreme parallelism

    Get PDF
    Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum’s User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient

    Asynchronous epidemic algorithms for consistency in large-scale systems

    Get PDF
    Achieving and detecting a globally consistent state is essential to many services in the large and extreme-scale distributed systems, especially when the desired consistent state is critical for services operation. Centralised and deterministic approaches for synchronisation and distributed consistency are not scalable and not fault-tolerant. Alternatively, epidemic-based paradigms are decentralised computations based on randomised communications. They are scalable, resilient, fault-tolerant, and converge to the desired target in logarithmic time with respect to system size. Thus, many distributed services have adopted epidemic protocols to achieve the consensus and the consistent state, mainly due to scalability concerns. The convergence of epidemic protocols is stochastically guaranteed. However, the detection of the convergence is probabilistic and non-explicit. In a real-world environment, systems are unreliable, and epidemic protocols cannot converge to the desired state. Thus, achieving convergence by itself does not ensure making a system-wide consistent state under dynamic conditions. The research work presented in this thesis introduces the Phase Transition Algorithm (PTA) to achieve distributed consistent state based on the explicit detection of convergence. Each phase in PTA is a decentralised decision-making process that implements epidemic data aggregation, in which the detection of convergence implies achieving a global agreement. The phases in PTA can be cascaded to achieve higher certainty as desired. Following the PTA, two epidemic protocols, namely PTP and ECP, are proposed to acquire of consensus, i.e. for the consistency in data dissemination and data aggregation. The protocols are examined through simulations, and experimental results have validated the protocols ability to achieve and explicitly detect the consensus among system nodes. The research work has also studied the epidemic data aggregation under nodes churn and network failures, in which the analysis has identified three phases of the aggregation process. The investigations have shown a different impact of nodes churn on each phase. The phase that is critical for the aggregation process has been studied further, which led to propose new robust data aggregation protocols, REAP and REAP+. Each protocol has a different decentralised replication method, and both implements distributed failure detection and instantaneous mass restoration mechanisms. Simulations have validated the protocols, and results have shown protocols ability to converge, detect convergence, and produce competitive accuracy under various levels of nodes churn. Furthermore, distributed consistency in continuous systems is addressed in the research. The work has proposed a novel continuous epidemic protocol with the adaptive restart mechanism. The protocol restarts either upon the detection of system convergence or upon the detection of divergence. Also, the protocol introduces the seed selection method for the peak data distribution in decentralised approaches, which was a challenge that requires single-point initialisation and leader-election step. The simulations validated the performance of the algorithm under static and dynamic conditions and approved that convergence and divergence detection accuracy can be tuned as desired. Finally, the research work shows that combining and integrating of the proposed protocols enables extreme-scale distributed systems to achieve and detect global consistent states even under realistic and dynamical conditions

    Tamper-Resistant Peer-to-Peer Storage for File Integrity Checking.

    Get PDF
    “... oba es gibt kan Kompromiß, zwischen ehrlich sein und link, a wann’s no so afoch ausschaut, und wann’s noch so üblich is...” — Wolfgang Ambros, 1975 One of the activities of most successful intruders of a computer system is to modify data on the victim, either to hide his/her presence and to destroy the evidence of the break-in, or to subvert the system completely and make it accessible for further abuse without triggering alarms. File integrity checking is one common method to mitigate the effects of successful intrusions by detecting the changes an intruder makes to files on a computer system. Historically file integrity checking has been implemented using tools that operate locally on a single system, which imposes quite some restrictions regarding maintenance and scalability. Recent improvements for large scale environments have introduced trusted central servers which provide secure fingerprint storage and logging facilities, but such centralism presents some new shortcomings

    Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

    Get PDF
    As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient

    Resilient gossip-inspired all-reduce algorithms for high-performance computing - Potential, limitations, and open questions

    Get PDF
    We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.This work has been partially funded by the Spanish Ministry of Science and Innovation [contract TIN2015-65316]; by the Government of Catalonia [contracts 2014-SGR-1051, 2014-SGR-1272]; by the RoMoL ERC Advanced Grant [grant number GA 321253] and by the Vienna Science and Technology Fund (WWTF) through project ICT15-113.Peer ReviewedPostprint (author's final draft

    International BEAT-PCD consensus statement for infection prevention and control for primary ciliary dyskinesia in collaboration with ERN-LUNG PCD Core Network and patient representatives

    Get PDF
    Discinesia ciliar primaria; Consenso; Prevención de infeccionesDiscinèsia ciliar primària; Consens; Prevenció d'infeccionsPrimary ciliary dyskinesia; Consensus; Infection preventionIntroduction In primary ciliary dyskinesia (PCD) impaired mucociliary clearance leads to recurrent airway infections and progressive lung destruction, and concern over chronic airway infection and patient-to-patient transmission is considerable. So far, there has been no defined consensus on how to control infection across centres caring for patients with PCD. Within the BEAT-PCD network, COST Action and ERS CRC together with the ERN-Lung PCD core a first initiative has now been taken towards creating such a consensus statement. Methods A multidisciplinary international PCD expert panel was set up to create a consensus statement for infection prevention and control (IP&C) for PCD, covering diagnostic microbiology, infection prevention for specific pathogens considered indicated for treatment and segregation aspects. Using a modified Delphi process, consensus to a statement demanded at least 80% agreement within the PCD expert panel group. Patient organisation representatives were involved throughout the process. Results We present a consensus statement on 20 IP&C statements for PCD including suggested actions for microbiological identification, indications for treatment of Pseudomonas aeruginosa, Burkholderia cepacia and nontuberculous mycobacteria and suggested segregation aspects aimed to minimise patient-to-patient transmission of infections whether in-hospital, in PCD clinics or wards, or out of hospital at meetings between people with PCD. The statement also includes segregation aspects adapted to the current coronavirus disease 2019 (COVID-19) pandemic. Conclusion The first ever international consensus statement on IP&C intended specifically for PCD is presented and is targeted at clinicians managing paediatric and adult patients with PCD, microbiologists, patient organisations and not least the patients and their families.Ministerstvo Zdravotnictví Ceské Republiky (grant nr. NV-19-07-00210.) European Reference Network for Rare Respiratory Diseases (Project ID No 739546.) Børnelungefonden European Cooperation in Science and Technology (COST Action BM 1407) Danmarks Lungeforening European Respiratory Society (CRC

    International BEAT-PCD consensus statement for infection prevention and control for primary ciliary dyskinesia in collaboration with ERN-LUNG PCD Core Network and patient representatives.

    Get PDF
    Introduction In primary ciliary dyskinesia (PCD) impaired mucociliary clearance leads to recurrent airway infections and progressive lung destruction, and concern over chronic airway infection and patient-to-patient transmission is considerable. So far, there has been no defined consensus on how to control infection across centres caring for patients with PCD. Within the BEAT-PCD network, COST Action and ERS CRC together with the ERN-Lung PCD core a first initiative has now been taken towards creating such a consensus statement. Methods A multidisciplinary international PCD expert panel was set up to create a consensus statement for infection prevention and control (IP&C) for PCD, covering diagnostic microbiology, infection prevention for specific pathogens considered indicated for treatment and segregation aspects. Using a modified Delphi process, consensus to a statement demanded at least 80% agreement within the PCD expert panel group. Patient organisation representatives were involved throughout the process. Results We present a consensus statement on 20 IP&C statements for PCD including suggested actions for microbiological identification, indications for treatment of Pseudomonas aeruginosa, Burkholderia cepacia and nontuberculous mycobacteria and suggested segregation aspects aimed to minimise patient-to-patient transmission of infections whether in-hospital, in PCD clinics or wards, or out of hospital at meetings between people with PCD. The statement also includes segregation aspects adapted to the current coronavirus disease 2019 (COVID-19) pandemic. Conclusion The first ever international consensus statement on IP&C intended specifically for PCD is presented and is targeted at clinicians managing paediatric and adult patients with PCD, microbiologists, patient organisations and not least the patients and their families
    corecore