657 research outputs found
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems
Performance Evaluation of Checkpoint/Restart Techniques
Distributed applications running on a large cluster environment, such as the
cloud instances will have shorter execution time. However, the application
might suffer from sudden termination due to unpredicted computing node
failures, thus loosing the whole computation. Checkpoint/restart is a fault
tolerance technique used to solve this problem. In this work we evaluated the
performance of two of the most commonly used checkpoint/restart techniques
(Distributed Multithreaded Checkpointing (DMTCP) and Berkeley Lab
Checkpoint/Restart library (BLCR) integrated into the OpenMPI framework). We
aimed to test their validity and evaluate their performance in both local and
Amazon Elastic Compute Cloud (EC2) environments. The experiments were conducted
on Amazon EC2 as a well-known proprietary cloud computing service provider.
Results obtained were reported and compared to evaluate checkpoint and restart
time values, data scalability and compute processes scalability. The findings
proved that DMTCP performs better than BLCR for checkpoint and restart speed,
data scalability and compute processes scalability experiments
A Study of Checkpointing in Large Scale Training of Deep Neural Networks
Deep learning (DL) applications are increasingly being deployed on HPC
systems, to leverage the massive parallelism and computing power of those
systems for DL model training. While significant effort has been put to
facilitate distributed training by DL frameworks, fault tolerance has been
largely ignored. In this work, we evaluate checkpoint-restart, a common fault
tolerance technique in HPC workloads. We perform experiments with three
state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow).
We evaluate the computational cost of checkpointing, file formats and file
sizes, the impact of scale, and deterministic checkpointing. Our evaluation
shows some critical differences in checkpoint mechanisms and exposes several
bottlenecks in existing checkpointing implementations. We provide discussion
points that can aid users in selecting a fault-tolerant framework to use in
HPC. We also provide takeaway points that framework developers can use to
facilitate better checkpointing of DL workloads in HPC
- âŠ