1,000 research outputs found

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

    Full text link
    Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

    Systematic composition of distributed objects: Processes and sessions

    Get PDF
    We consider a system with the infrastructure for the creation and interconnection of large numbers of distributed persistent objects. This system is exemplified by the Internet: potentially, every appliance and document on the Internet has both persistent state and the ability to interact with large numbers of other appliances and documents on the Internet. This paper elucidates the characteristics of such a system, and proposes the compositional requirements of its corresponding infrastructure. We explore the problems of specifying, composing, reasoning about and implementing applications in such a system. A specific concern of our research is developing the infrastructure to support structuring distributed applications by using sequential, choice and parallel composition, in the anarchic environment where application compositions may be unforeseeable and interactions may be unknown prior to actually occurring. The structuring concepts discussed are relevant to a wide range of distributed applications; our implementation is illustrated with collaborative Java processes interacting over the Internet, but the methodology provided can be applied independent of specific platforms

    Condor services for the Global Grid:interoperability between Condor and OGSA

    Get PDF
    In order for existing grid middleware to remain viable it is important to investigate their potentialfor integration with emerging grid standards and architectural schemes. The Open Grid ServicesArchitecture (OGSA), developed by the Globus Alliance and based on standard XML-based webservices technology, was the first attempt to identify the architectural components required tomigrate towards standardized global grid service delivery. This paper presents an investigation intothe integration of Condor, a widely adopted and sophisticated high-throughput computing softwarepackage, and OGSA; with the aim of bringing Condor in line with advances in Grid computing andprovide the Grid community with a mature suite of high-throughput computing job and resourcemanagement services. This report identifies mappings between elements of the OGSA and Condorinfrastructures, potential areas of conflict, and defines a set of complementary architectural optionsby which individual Condor services can be exposed as OGSA Grid services, in order to achieve aseamless integration of Condor resources in a standardized grid environment

    Asynchronous Validity Resolution in Sequentially Consistent Shared Virtual Memory

    Get PDF
    Shared Virtual Memory (SVM) is an effort to provide a mechanism for a distributed system, such as a cluster, to execute shared memory parallel programs. Unfortunately, SVM has performance problems due to its underlying distributed architecture. Recent developments have increased performance of SVM by reducing communication. Unfortunately this performance gain was only possible by increasing programming complexity and by restricting the types of programs allowed to execute in the system. Validity resolution is the process of resolving the validity of a memory object such as a page. Current SVM systems use synchronous or deferred validity resolution techniques in which user processing is blocked during the validity resolution process. This is the case even when resolving validity of false shared variables. False-sharing occurs when two or more processes access unrelated variables stored within the same shared block of memory and at least one of the processes is writing. False sharing unnecessarily reduces overall performance of SVM systems?because user processing is blocked during validity resolution although no actual data dependencies exist. This thesis presents Asynchronous Validity Resolution (AVR), a new approach to SVM which reduces the performance losses associated with false sharing while maintaining the ease of programming found with regular shared memory parallel programming methodology. Asynchronous validity resolution allows concurrent user process execution and data validity resolution. AVR is evaluated by com-paring performance of an application suite using both an AVR sequentially con-sistent SVM system and a traditional sequentially consistent (SC) SVM system. The results show that AVR can increase performance over traditional sequentially consistent SVM for programs which exhibit false sharing. Although AVR outperforms regular SC by as much as 26%, performance of AVR is dependent on the number of false-sharing vs. true-sharing accesses, the number of pages in the program’s working set, the amount of user computation that completes per page request, and the internodal round-trip message time in the system. Overall, the results show that AVR could be an important member of the arsenal of tools available to parallel programmers
    • …
    corecore