Search CORE

5,366 research outputs found

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Author: Hager Georg
Kreutzer Moritz
Shahzad Faisal
Thies Jonas
Wellein Gerhard
Zeiser Thomas
Publication venue
Publication date: 07/08/2017
Field of study

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

arXiv.org e-Print Archive

Institute of Transport Research:Publications

Operating System Support for Redundant Multithreading

Author: Döbel Björn
Publication venue
Publication date: 25/11/2014
Field of study

Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs. ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication. I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB). I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors

Technische Universität Dresden: Qucosa

CSP Hybrid Space Computing for STP-H5/ISEM on ISS

Author: Beck Jaclyn
Coole James
Crum Gary
Flatley Tom
Gauvin Patrick
George Alan
MacKinnon James
Stewart Jacob
Stoddard Aaron
Timmons Elizabeth
Urriste Jonathan
Wilson Christopher
Wirthlin Mike
Wison Alex
Publication venue: DigitalCommons@USU
Publication date: 12/08/2015
Field of study

The Space Test Program (STP) at the Department of Defense (DoD) supports the development, evaluation, and advancement of new technologies needed for the future of spaceflight. STP-Houston provides opportunities for DoD and civilian space agencies to perform on-orbit research and technology demonstrations from the International Space Station (ISS). The STP-H5/ISEM (STP-Houston 5, ISS SpaceCube Experiment Mini) payload is scheduled for launch on the upcoming SpaceX 10 mission and will feature new technologies, including a hybrid space computer developed by the NSF CHREC Center, working closely with the NASA SpaceCube Team, known as the CHREC Space Processor (CSP). In this paper, we present the novel concepts behind CSP and the CSPv1 flight technologies on the ISEM mission. The ISEM-CSP system was subjected to environmental testing, including a thermal vacuum test, a vibration test, and two radiation tests, and results were encouraging and are presented. Primary objectives for ISEM-CSP are highlighted, which include processing, compression, and downlink of terrestrial-scene images for display on Earth, and monitoring of upset rates in various subsystems to provide environmental information for future missions. Secondary objectives are also presented, including experiments with features for fault-tolerant computing, reliable middleware services, FPGA partial reconfiguration, device virtualization, and dynamic synthesis

DigitalCommons@USU

SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

Author: De Giusti Armando Eduardo
Frati Fernando Emmanuel
Luquet Emilio
Montezanti Diego Miguel
Naiouf Marcelo
Rexachs del Rosario Dolores
Publication venue
Publication date: 22/05/2020
Field of study

The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.Instituto de Investigación en Informátic

Servicio de Difusión de la Creación Intelectual

SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

Author: De Giusti Armando Eduardo
Frati Fernando Emmanuel
Luquet Emilio
Montezanti Diego Miguel
Naiouf Marcelo
Rexachs del Rosario Dolores
Publication venue
Publication date: 22/05/2020
Field of study

Secure and efficient application monitoring and replication

Author: Coppens Bart
De Sutter Bjorn
Franz Michael
Homescu Andrei
Larsen Per
Volckaert Stijn
Voulimeneas Alexios
Publication venue: Usenix Assoc
Publication date: 01/01/2016
Field of study

Memory corruption vulnerabilities remain a grave threat to systems software written in C/C++. Current best practices dictate compiling programs with exploit mitigations such as stack canaries, address space layout randomization, and control-flow integrity. However, adversaries quickly find ways to circumvent such mitigations, sometimes even before these mitigations are widely deployed. In this paper, we focus on an "orthogonal" defense that amplifies the effectiveness of traditional exploit mitigations. The key idea is to create multiple diversified replicas of a vulnerable program and then execute these replicas in lockstep on identical inputs while simultaneously monitoring their behavior. A malicious input that causes the diversified replicas to diverge in their behavior will be detected by the monitor; this allows discovery of previously unknown attacks such as zero-day exploits. So far, such multi-variant execution environments (MVEEs) have been held back by substantial runtime overheads. This paper presents a new design, ReMon, that is non-intrusive, secure, and highly efficient. Whereas previous schemes either monitor every system call or none at all, our system enforces cross-checking only for security critical system calls while supporting more relaxed monitoring policies for system calls that are not security critical. We achieve this by splitting the monitoring and replication logic into an in-process component and a cross-process component. Our evaluation shows that ReMon offers same level of security as conservative MVEEs and run realistic server benchmarks at near-native speeds

Ghent University Academic Bibliography

Spartan Daily, February 17, 1997

Author: San Jose State University School of Journalism and Mass Communications
Publication venue: SJSU ScholarWorks
Publication date: 17/02/1997
Field of study

Volume 108, Issue 17https://scholarworks.sjsu.edu/spartandaily/9094/thumbnail.jp

SJSU ScholarWorks

Taming parallelism in a multi-variant execution environment

Author: Coppens Bart
De Bosschere Koen
De Sutter Bjorn
Franz Michael
Larsen Per
Volckaert Stijn
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Ghent University Academic Bibliography