20 research outputs found
A fault-tolerance protocol for parallel applications with communication imbalance
ArticuloThe predicted failure rates of future supercomputers
loom the groundbreaking research large machines are
expected to foster. Therefore, resilient extreme-scale applications
are an absolute necessity to effectively use the new generation
of supercomputers. Rollback-recovery techniques have been
traditionally used in HPC to provide resilience. Among those
techniques, message logging provides the appealing features of
saving energy, accelerating recovery, and having low performance
penalty. Its increased memory consumption is, however, an
important downside. This paper introduces memory-constrained
message logging (MCML), a general framework for decreasing the
memory footprint of message-logging protocols. In particular, we
demonstrate the effectiveness of MCML in maintaining message
logging feasible for applications with substantial communication
imbalance. This type of applications appear in many scientific
fields. We present experimental results with several parallel codes
running on up to 4,096 cores. Using those results and an analytical
model, we predict MCML can reduce execution time up to 25%
and energy consumption up to 15%, at extreme scale
Performance Evaluation of Checkpoint/Restart Techniques
Distributed applications running on a large cluster environment, such as the
cloud instances will have shorter execution time. However, the application
might suffer from sudden termination due to unpredicted computing node
failures, thus loosing the whole computation. Checkpoint/restart is a fault
tolerance technique used to solve this problem. In this work we evaluated the
performance of two of the most commonly used checkpoint/restart techniques
(Distributed Multithreaded Checkpointing (DMTCP) and Berkeley Lab
Checkpoint/Restart library (BLCR) integrated into the OpenMPI framework). We
aimed to test their validity and evaluate their performance in both local and
Amazon Elastic Compute Cloud (EC2) environments. The experiments were conducted
on Amazon EC2 as a well-known proprietary cloud computing service provider.
Results obtained were reported and compared to evaluate checkpoint and restart
time values, data scalability and compute processes scalability. The findings
proved that DMTCP performs better than BLCR for checkpoint and restart speed,
data scalability and compute processes scalability experiments
Support for adaptivity in ARMCI using migratable objects
Many new paradigms of parallel programming have emerged that compete with and complement the standard and well-established MPI model. Most notable, and suc-cessful, among these are models that support some form of global address space. At the same time, approaches based on migratable objects (also called virtualized processes) have shown that resource management concerns can be sep-arated effectively from the overall parallel programming ef-fort. For example, Charm++ supports dynamic load bal-ancing via an intelligent adaptive run-time system. It is also becoming clear that a multi-paradigm approach that allows modules written in one or more paradigms to coexist and co-operate will be necessary to tame the parallel pro-gramming challenge. ARMCI is a remote memory copy library that serves as a foundation of many global address space languages and libraries. This paper presents our preliminary work on inte-grating and supporting ARMCI with the adaptive run-time system of Charm++ as a part of our overall effort in the multi-paradigm approach.
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded. Index Terms—Grid computing, rollback recovery, checkpointing, event logging. Ç
Extending Scojo-PECT by migration based on system-level checkpointing
In recent years, a significant amount of research has been done on job scheduling in high performance computing area. Parallel jobs have different running time and require a different number of processors, thus jobs need to be scheduled and packed to improve system utilization. Scojo-PECT is a job scheduler which provides service guarantees by using coarse-grain time sharing. However, Scojo-PECT does not provide process migration. We extend the Scojo-PECT by migrating parallel jobs based on system-level checkpointing. We investigate different cases in the Scojo-PECT scheduling algorithm where migration based on system-level checkpointing can be used to improve resource utilization and reduce job response time. Our experimental results show reduction of relative response times on medium jobs over the results of the original Scojo-PECT scheduler and the long jobs do not suffer any disadvantage
Composing resilience techniques: ABFT, periodic and incremental checkpointing
An electrochemical sensor is described for the determination of L-dopa (levodopa; 3,4-dihydroxyphenylalanine). An inkjet-printed carbon nanotube (IJPCNT) electrode was modified with manganese dioxide microspheres by drop-casting. They coating was characterized by field emission scanning electron microscopy, Fourier-transform infrared spectroscopy and X-ray powder diffraction. The sensor, best operated at a working voltage of 0.3 V, has a linear response in the 0.1 to 10 \u3bcM L-dopa concentration range, a 54 nM detection limit, excellent reproducibility, repeatability and selectivity. The amperometric approach was applied to the determination of L-dopa in spiked biological fluids and displayed satisfactory accuracy and precision. Graphical abstract Schematic representation of an amperometric method for determination L-dopa. It is based on the use of inkjet-printed carbon nanotube electrode (IJPCNT) modified with manganese dioxide (MnO2)