6,268 research outputs found

    Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

    Get PDF
    Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

    Rapid Recovery for Systems with Scarce Faults

    Full text link
    Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202

    DOLFIN: Automated Finite Element Computing

    Get PDF
    We describe here a library aimed at automating the solution of partial differential equations using the finite element method. By employing novel techniques for automated code generation, the library combines a high level of expressiveness with efficient computation. Finite element variational forms may be expressed in near mathematical notation, from which low-level code is automatically generated, compiled and seamlessly integrated with efficient implementations of computational meshes and high-performance linear algebra. Easy-to-use object-oriented interfaces to the library are provided in the form of a C++ library and a Python module. This paper discusses the mathematical abstractions and methods used in the design of the library and its implementation. A number of examples are presented to demonstrate the use of the library in application code

    Parameterized Model-Checking for Timed-Systems with Conjunctive Guards (Extended Version)

    Full text link
    In this work we extend the Emerson and Kahlon's cutoff theorems for process skeletons with conjunctive guards to Parameterized Networks of Timed Automata, i.e. systems obtained by an \emph{apriori} unknown number of Timed Automata instantiated from a finite set U1,ā€¦,UnU_1, \dots, U_n of Timed Automata templates. In this way we aim at giving a tool to universally verify software systems where an unknown number of software components (i.e. processes) interact with continuous time temporal constraints. It is often the case, indeed, that distributed algorithms show an heterogeneous nature, combining dynamic aspects with real-time aspects. In the paper we will also show how to model check a protocol that uses special variables storing identifiers of the participating processes (i.e. PIDs) in Timed Automata with conjunctive guards. This is non-trivial, since solutions to the parameterized verification problem often relies on the processes to be symmetric, i.e. indistinguishable. On the other side, many popular distributed algorithms make use of PIDs and thus cannot directly apply those solutions
    • ā€¦
    corecore