762 research outputs found
Epidemic failure detection and consensus for extreme parallelism
Future extreme-scale high-performance computing systems will be required
to work under frequent component failures. The MPI Forum’s User
Level Failure Mitigation proposal has introduced an operation,
MPI Comm shrink, to synchronize the alive processes on the list of failed
processes, so that applications can continue to execute even in the presence
of failures by adopting algorithm-based fault tolerance techniques. This
MPI Comm shrink operation requires a failure detection and consensus
algorithm. This paper presents three novel failure detection and consensus
algorithms using Gossiping. Stochastic pinging is used to quickly detect
failures during the execution of the algorithm, failures are then disseminated
to all the fault-free processes in the system and consensus on the
failures is detected using the three consensus techniques. The proposed
algorithms were implemented and tested using the Extreme-scale Simulator.
The results show that the stochastic pinging detects all the failures in
the system. In all the algorithms, the number of Gossip cycles to achieve
global consensus scales logarithmically with system size. The second algorithm
also shows better scalability in terms of memory and network
bandwidth usage and a perfect synchronization in achieving global consensus.
The third approach is a three-phase distributed failure detection
and consensus algorithm and provides consistency guarantees even in very
large and extreme-scale systems while at the same time being memory and
bandwidth efficient
Asynchronous epidemic algorithms for consistency in large-scale systems
Achieving and detecting a globally consistent state is essential to many services in the large
and extreme-scale distributed systems, especially when the desired consistent state is critical
for services operation. Centralised and deterministic approaches for synchronisation and
distributed consistency are not scalable and not fault-tolerant. Alternatively, epidemic-based
paradigms are decentralised computations based on randomised communications. They are
scalable, resilient, fault-tolerant, and converge to the desired target in logarithmic time with
respect to system size. Thus, many distributed services have adopted epidemic protocols
to achieve the consensus and the consistent state, mainly due to scalability concerns. The
convergence of epidemic protocols is stochastically guaranteed. However, the detection of
the convergence is probabilistic and non-explicit. In a real-world environment, systems are
unreliable, and epidemic protocols cannot converge to the desired state. Thus, achieving
convergence by itself does not ensure making a system-wide consistent state under dynamic
conditions.
The research work presented in this thesis introduces the Phase Transition Algorithm
(PTA) to achieve distributed consistent state based on the explicit detection of convergence.
Each phase in PTA is a decentralised decision-making process that implements epidemic data
aggregation, in which the detection of convergence implies achieving a global agreement. The
phases in PTA can be cascaded to achieve higher certainty as desired. Following the PTA,
two epidemic protocols, namely PTP and ECP, are proposed to acquire of consensus, i.e. for
the consistency in data dissemination and data aggregation. The protocols are examined
through simulations, and experimental results have validated the protocols ability to achieve
and explicitly detect the consensus among system nodes.
The research work has also studied the epidemic data aggregation under nodes churn and
network failures, in which the analysis has identified three phases of the aggregation process.
The investigations have shown a different impact of nodes churn on each phase. The phase
that is critical for the aggregation process has been studied further, which led to propose
new robust data aggregation protocols, REAP and REAP+. Each protocol has a different
decentralised replication method, and both implements distributed failure detection and
instantaneous mass restoration mechanisms. Simulations have validated the protocols, and
results have shown protocols ability to converge, detect convergence, and produce competitive
accuracy under various levels of nodes churn.
Furthermore, distributed consistency in continuous systems is addressed in the research.
The work has proposed a novel continuous epidemic protocol with the adaptive restart
mechanism. The protocol restarts either upon the detection of system convergence or upon
the detection of divergence. Also, the protocol introduces the seed selection method for
the peak data distribution in decentralised approaches, which was a challenge that requires
single-point initialisation and leader-election step. The simulations validated the performance
of the algorithm under static and dynamic conditions and approved that convergence and
divergence detection accuracy can be tuned as desired.
Finally, the research work shows that combining and integrating of the proposed protocols
enables extreme-scale distributed systems to achieve and detect global consistent states even
under realistic and dynamical conditions
Tamper-Resistant Peer-to-Peer Storage for File Integrity Checking.
“... oba es gibt kan Kompromiß, zwischen ehrlich sein und link, a wann’s no so afoch ausschaut, und wann’s noch so üblich is...” — Wolfgang Ambros, 1975 One of the activities of most successful intruders of a computer system is to modify data on the victim, either to hide his/her presence and to destroy the evidence of the break-in, or to subvert the system completely and make it accessible for further abuse without triggering alarms. File integrity checking is one common method to mitigate the effects of successful intrusions by detecting the changes an intruder makes to files on a computer system. Historically file integrity checking has been implemented using tools that operate locally on a single system, which imposes quite some restrictions regarding maintenance and scalability. Recent improvements for large scale environments have introduced trusted central servers which provide secure fingerprint storage and logging facilities, but such centralism presents some new shortcomings
Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension
As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software.
First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications.
Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient
Resilient gossip-inspired all-reduce algorithms for high-performance computing - Potential, limitations, and open questions
We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.This work has been partially funded by the Spanish Ministry of Science and Innovation [contract TIN2015-65316]; by the Government of Catalonia [contracts 2014-SGR-1051, 2014-SGR-1272]; by the RoMoL ERC Advanced Grant [grant number GA 321253] and by the Vienna Science and Technology Fund (WWTF) through project ICT15-113.Peer ReviewedPostprint (author's final draft
International BEAT-PCD consensus statement for infection prevention and control for primary ciliary dyskinesia in collaboration with ERN-LUNG PCD Core Network and patient representatives
Discinesia ciliar primaria; Consenso; Prevención de infeccionesDiscinèsia ciliar primària; Consens; Prevenció d'infeccionsPrimary ciliary dyskinesia; Consensus; Infection preventionIntroduction In primary ciliary dyskinesia (PCD) impaired mucociliary clearance leads to recurrent airway infections and progressive lung destruction, and concern over chronic airway infection and patient-to-patient transmission is considerable. So far, there has been no defined consensus on how to control infection across centres caring for patients with PCD. Within the BEAT-PCD network, COST Action and ERS CRC together with the ERN-Lung PCD core a first initiative has now been taken towards creating such a consensus statement.
Methods A multidisciplinary international PCD expert panel was set up to create a consensus statement for infection prevention and control (IP&C) for PCD, covering diagnostic microbiology, infection prevention for specific pathogens considered indicated for treatment and segregation aspects. Using a modified Delphi process, consensus to a statement demanded at least 80% agreement within the PCD expert panel group. Patient organisation representatives were involved throughout the process.
Results We present a consensus statement on 20 IP&C statements for PCD including suggested actions for microbiological identification, indications for treatment of Pseudomonas aeruginosa, Burkholderia cepacia and nontuberculous mycobacteria and suggested segregation aspects aimed to minimise patient-to-patient transmission of infections whether in-hospital, in PCD clinics or wards, or out of hospital at meetings between people with PCD. The statement also includes segregation aspects adapted to the current coronavirus disease 2019 (COVID-19) pandemic.
Conclusion The first ever international consensus statement on IP&C intended specifically for PCD is presented and is targeted at clinicians managing paediatric and adult patients with PCD, microbiologists, patient organisations and not least the patients and their families.Ministerstvo Zdravotnictví Ceské Republiky (grant nr. NV-19-07-00210.) European Reference Network for Rare Respiratory Diseases (Project ID No 739546.) Børnelungefonden European Cooperation in Science and Technology (COST Action BM 1407) Danmarks Lungeforening European Respiratory Society (CRC
International BEAT-PCD consensus statement for infection prevention and control for primary ciliary dyskinesia in collaboration with ERN-LUNG PCD Core Network and patient representatives.
Introduction
In primary ciliary dyskinesia (PCD) impaired mucociliary clearance leads to recurrent airway infections and progressive lung destruction, and concern over chronic airway infection and patient-to-patient transmission is considerable. So far, there has been no defined consensus on how to control infection across centres caring for patients with PCD. Within the BEAT-PCD network, COST Action and ERS CRC together with the ERN-Lung PCD core a first initiative has now been taken towards creating such a consensus statement.
Methods
A multidisciplinary international PCD expert panel was set up to create a consensus statement for infection prevention and control (IP&C) for PCD, covering diagnostic microbiology, infection prevention for specific pathogens considered indicated for treatment and segregation aspects. Using a modified Delphi process, consensus to a statement demanded at least 80% agreement within the PCD expert panel group. Patient organisation representatives were involved throughout the process.
Results
We present a consensus statement on 20 IP&C statements for PCD including suggested actions for microbiological identification, indications for treatment of Pseudomonas aeruginosa, Burkholderia cepacia and nontuberculous mycobacteria and suggested segregation aspects aimed to minimise patient-to-patient transmission of infections whether in-hospital, in PCD clinics or wards, or out of hospital at meetings between people with PCD. The statement also includes segregation aspects adapted to the current coronavirus disease 2019 (COVID-19) pandemic.
Conclusion
The first ever international consensus statement on IP&C intended specifically for PCD is presented and is targeted at clinicians managing paediatric and adult patients with PCD, microbiologists, patient organisations and not least the patients and their families
- …