420 research outputs found
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
09201 Abstracts Collection -- Self-Healing and Self-Adaptive Systems
From May 10th 2009 to May 15th 2009 the Dagstuhl Seminar 09201 ``Self-Healing and Self-Adaptive Systems\u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar are put together in this paper. Links to extended abstracts or full papers are provided, if available.
A description of the seminar topics, goals and results in general can be found in
a separate document ``Executive Summary\u27\u27
Finding real bugs in big programs with incorrectness logic
Incorrectness Logic (IL) has recently been advanced as a logical theory for compositionally proving the presence of bugs—dual to Hoare Logic, which is used to compositionally prove their absence. Though IL was motivated in large part by the aim of providing a logical foundation for bug-catching program analyses, it has remained an open question: is IL useful only retrospectively (to explain existing analyses), or can it actually be useful in developing new analyses which can catch real bugs in big programs? In this work, we develop Pulse-X, a new, automatic program analysis for catching memory errors, based on ISL, a recent synthesis of IL and separation logic. Using Pulse-X, we have found 15 new real bugs in OpenSSL, which we have reported to OpenSSL maintainers and have since been fixed. In order not to be overwhelmed with potential but false error reports, we develop a compositional bug-reporting criterion based on a distinction between latent and manifest errors, which references the under-approximate ISL abstractions computed by Pulse-X, and we investigate the fix rate resulting from application of this criterion. Finally, to probe the potential practicality of our bug-finding method, we conduct a comparison to Infer, a widely used analyzer which has proven useful in industrial engineering practice
SimGrid MC: Verification Support for a Multi-API Simulation Platform
The original publication is available at www.springerlink.comInternational audienceSimGrid MC is a stateless model checker for distributed systems that is part of the SimGrid Simulation Framework. It verifies implementations of distributed algorithms, written in C and using any of several communication APIs provided by the simulator. Because the model checker is fully integrated in the simulator that programmers use to validate their implementations, they gain powerful verification capabilities without having to adapt their code. We describe the architecture of SimGrid MC, and show how it copes with the state space explosion problem. In particular, we argue that a generic Dynamic Partial Order Reductions algorithm is effective for handling the different communication APIs that are provided by SimGrid. As a case study, we verify an implementation of Chord, where SimGrid MC helped us discover an intricate bug in a matter of seconds
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to
present their current research, and to discuss topics with other students in order to look for synergies and common research
topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to
achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable
solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big
data management, training, contributing to glue disparate researchers working across different areas and provide a meeting
ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in
research topics such as sustainable software solutions (applications and system software stack), data management, energy
efficiency, and resilience.European Cooperation in Science and Technology. COS
- …