23,673 research outputs found

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

    Technical Report on Deploying a highly secured OpenStack Cloud Infrastructure using BradStack as a Case Study

    Full text link
    Cloud computing has emerged as a popular paradigm and an attractive model for providing a reliable distributed computing model.it is increasing attracting huge attention both in academic research and industrial initiatives. Cloud deployments are paramount for institution and organizations of all scales. The availability of a flexible, free open source cloud platform designed with no propriety software and the ability of its integration with legacy systems and third-party applications are fundamental. Open stack is a free and opensource software released under the terms of Apache license with a fragmented and distributed architecture making it highly flexible. This project was initiated and aimed at designing a secured cloud infrastructure called BradStack, which is built on OpenStack in the Computing Laboratory at the University of Bradford. In this report, we present and discuss the steps required in deploying a secured BradStack Multi-node cloud infrastructure and conducting Penetration testing on OpenStack Services to validate the effectiveness of the security controls on the BradStack platform. This report serves as a practical guideline, focusing on security and practical infrastructure related issues. It also serves as a reference for institutions looking at the possibilities of implementing a secured cloud solution.Comment: 38 pages, 19 figures

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    Computational needs survey of NASA automation and robotics missions. Volume 1: Survey and results

    Get PDF
    NASA's operational use of advanced processor technology in space systems lags behind its commercial development by more than eight years. One of the factors contributing to this is that mission computing requirements are frequently unknown, unstated, misrepresented, or simply not available in a timely manner. NASA must provide clear common requirements to make better use of available technology, to cut development lead time on deployable architectures, and to increase the utilization of new technology. A preliminary set of advanced mission computational processing requirements of automation and robotics (A&R) systems are provided for use by NASA, industry, and academic communities. These results were obtained in an assessment of the computational needs of current projects throughout NASA. The high percent of responses indicated a general need for enhanced computational capabilities beyond the currently available 80386 and 68020 processor technology. Because of the need for faster processors and more memory, 90 percent of the polled automation projects have reduced or will reduce the scope of their implementation capabilities. The requirements are presented with respect to their targeted environment, identifying the applications required, system performance levels necessary to support them, and the degree to which they are met with typical programmatic constraints. Volume one includes the survey and results. Volume two contains the appendixes
    • …
    corecore