107 research outputs found
A Recovery Scheme for Cluster Federations Using Sender-based Message Logging
A cluster federation is a union of clusters and is heterogeneous. Each cluster contains a certain number of processes. An application running in such a computing environment is divided into communicating modules so that these modules can run on different clusters. To achieve fault-tolerance different clusters may employ different check pointing schemes. For example, some may use coordinated schemes, while some other may use communication-induced schemes. It may complicate the recovery process. In this paper, we have addressed the complex problem of recovery for cluster computing environment. The proposed approach handles both inter cluster orphan and lost messages unlike the existing works in this area. We first propose an algorithm to determine a recovery line so that there does not exist any inter cluster orphan message between any pair of the cluster level check points belonging to the recovery line. The main feature of the proposed algorithm is that it can be executed simultaneously by all clusters in the cluster federation. Next we apply the sender-based message logging idea to effectively handle all inter cluster lost messages to ensure correctness of computation
Scalable group-based checkpoint/restart for large-scale message-passing systems
The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.published_or_final_versio
Performance comparison of hierarchical checkpoint protocols grid computing
Grid infrastructure is a large set of nodes
geographically distributed and connected by a communication. In
this context, fault tolerance is a necessity imposed by the
distribution that poses a number of problems related to the
heterogeneity of hardware, operating systems, networks,
middleware, applications, the dynamic resource, the scalability,
the lack of common memory, the lack of a common clock, the
asynchronous communication between processes. To improve the
robustness of supercomputing applications in the presence of
failures, many techniques have been developed to provide
resistance to these faults of the system. Fault tolerance is intended
to allow the system to provide service as specified in spite of
occurrences of faults. It appears as an indispensable element in
distributed systems. To meet this need, several techniques have
been proposed in the literature. We will study the protocols based
on rollback recovery. These protocols are classified into two
categories: coordinated checkpointing and rollback protocols and
log-based independent checkpointing protocols or message
logging protocols. However, the performance of a protocol
depends on the characteristics of the system, network and
applications running. Faced with the constraints of large-scale
environments, many of algorithms of the literature showed
inadequate. Given an application environment and a system, it is
not easy to identify the recovery protocol that is most appropriate
for a cluster or hierarchical environment, like grid computing.
While some protocols have been used successfully in small scale,
they are not suitable for use in large scale. Hence there is a need
to implement these protocols in a hierarchical fashion to compare
their performance in grid computing. In this paper, we propose
hierarchical version of four well-known protocols. We have
implemented and compare the performance of these protocols in
clusters and grid computing using the Omnet++ simulator
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
Author Index
Author Index: CIT Vol. 19 (2011), No 1–
Fault Tolerance for High-Performance Applications Using Structured Parallelism Models
In the last years parallel computing has increasingly exploited the high-level models of structured parallel programming, an example of which are algorithmic skeletons. This trend has been motivated by the properties featuring structured parallelism models, which can be used to derive several (static and dynamic) optimizations at various implementation levels. In this thesis we study the properties of structured parallel models useful for attacking the issue of providing a fault tolerance support oriented towards High-Performance applications. This issue has been traditionally faced in two ways: (i) in the context of unstructured parallelism models (e.g. MPI), which computation model is essentially based on a distributed set of processes communicating through message-passing, with an approach based on checkpointing and rollback recovery or software replication; (ii) in the context of high-level models, based on a specific parallelism model (e.g. data-flow) and/or an implementation model (e.g. master-slave), by introducing specific techniques based on the properties of the programming and computation models themselves. In this thesis we make a step towards a more abstract viewpoint and we highlight the properties of structured parallel models interesting for fault tolerance purposes. We consider two classes of parallel programs (namely task parallel and data parallel) and we introduce a fault tolerance support based on checkpointing and rollback recovery. The support is derived according to the high-level properties of the parallel models: we call this derivation specialization of fault tolerance techniques, highlighting the difference with classical solutions supporting structure-unaware computations. As a consequence of this specialization, the introduced fault tolerance techniques can be configured and optimized to meet specific needs at different implementation levels. That is, the supports we present do not target a single computing platform or a specific class of them. Indeed the specializations are the mechanism to target specific issues of the exploited environment and of the implemented applications, as proper choices of the protocols and their configurations
On Data Dissemination for Large-Scale Complex Critical Infrastructures
Middleware plays a key role for the achievement of the mission of future largescalecomplexcriticalinfrastructures, envisioned as federations of several heterogeneous systems over Internet. However, available approaches for datadissemination result still inadequate, since they are unable to scale and to jointly assure given QoS properties. In addition, the best-effort delivery strategy of Internet and the occurrence of node failures further exacerbate the correct and timely delivery of data, if the middleware is not equipped with means for tolerating such failures.
This paper presents a peer-to-peer approach for resilient and scalable datadissemination over large-scalecomplexcriticalinfrastructures. The approach is based on the adoption of epidemic dissemination algorithms between peer groups, combined with the semi-active replication of group leaders to tolerate failures and assure the resilient delivery of data, despite the increasing scale and heterogeneity of the federated system. The effectiveness of the approach is shown by means of extensive simulation experiments, based on Stochastic Activity Networks
Portable Checkpointing for Parallel Applications
High Performance Computing (HPC) systems represent the peak of modern computational capability. As
ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems,
modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these
scales, the huge number of individual components in a single system makes the probability that a single
component will fail quite high, with today's large HPC systems featuring mean times between failures on the
order of hours or a few days. As many modern computational tasks require days or months to complete, fault
tolerance becomes critical to HPC system design.
The past three decades have seen significant amounts of research on parallel system fault tolerance. However,
as most of it has been either theoretical or has focused on low-level solutions that are embedded into a
particular operating system or type of hardware, this work has had little impact on real HPC systems. This
thesis attempts to address this lack of impact by describing a high-level approach for implementing
checkpoint/restart functionality that decouples the fault tolerance solution from the details of the
operating system, system libraries and the hardware and instead connects it to the APIs implemented by the
above components. The resulting solution enables applications that use these APIs to become
self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement
the APIs.
The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It
presents two theoretical checkpointing protocols, one for the message passing communication model and one for
the shared memory model. The former is the first protocol to be compatible with application-level
checkpointing of individual processes, while the latter is the first protocol that is compatible with
arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing
protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel
APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these
popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms
and are shown to feature low overheads
- …