Search CORE

6,268 research outputs found

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Author: Alexandrov Vassil
McKee Gerard
Varghese Blesson
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

University of St. Andrews - Pure

St Andrews Research Repository

Recommended from our members

Neurons and symbols: a manifesto

Author: Garcez A.
Publication venue
Publication date: 01/07/2010
Field of study

We discuss the purpose of neural-symbolic integration including its principles, mechanisms and applications. We outline a cognitive computational model for neural-symbolic integration, position the model in the broader context of multi-agent systems, machine learning and automated reasoning, and list some of the challenges for the area of neural-symbolic computation to achieve the promise of effective integration of robust learning and expressive reasoning under uncertainty

City Research Online

Rapid Recovery for Systems with Scarce Faults

Author: Huang Chung-Hao
Peled Doron
Schewe Sven
Wang Farn
Publication venue: 'Open Publishing Association'
Publication date: 01/10/2012
Field of study

Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202

arXiv.org e-Print Archive

Directory of Open Access Journals

DOLFIN: Automated Finite Element Computing

Author: Logg Anders
Wells GN
Publication venue
Publication date: 01/01/2009
Field of study

We describe here a library aimed at automating the solution of partial differential equations using the finite element method. By employing novel techniques for automated code generation, the library combines a high level of expressiveness with efficient computation. Finite element variational forms may be expressed in near mathematical notation, from which low-level code is automatically generated, compiled and seamlessly integrated with efficient implementations of computational meshes and high-performance linear algebra. Easy-to-use object-oriented interfaces to the library are provided in the form of a C++ library and a Python module. This paper discusses the mathematical abstractions and methods used in the design of the library and its implementation. A number of examples are presented to demonstrate the use of the library in application code

CiteSeerX

Chalmers Research

Apollo (Cambridge)

Chalmers Publication Library

Parameterized Model-Checking for Timed-Systems with Conjunctive Guards (Extended Version)

Author: A Bouajjani
A Emerson
A Emerson
B Aminof
B Aminof
EA Emerson
F Pagliarecci
K Apt
L Zuck
P Bouyer
P Godefroid
PA Abdulla
PA Abdulla
S Ben-David
SM German
T Ball
TT Johnson
Y Hanna
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/07/2014
Field of study

In this work we extend the Emerson and Kahlon's cutoff theorems for process skeletons with conjunctive guards to Parameterized Networks of Timed Automata, i.e. systems obtained by an \emph{apriori} unknown number of Timed Automata instantiated from a finite set

U_1, \dots, U_n

of Timed Automata templates. In this way we aim at giving a tool to universally verify software systems where an unknown number of software components (i.e. processes) interact with continuous time temporal constraints. It is often the case, indeed, that distributed algorithms show an heterogeneous nature, combining dynamic aspects with real-time aspects. In the paper we will also show how to model check a protocol that uses special variables storing identifiers of the participating processes (i.e. PIDs) in Timed Automata with conjunctive guards. This is non-trivial, since solutions to the parameterized verification problem often relies on the processes to be symmetric, i.e. indistinguishable. On the other side, many popular distributed algorithms make use of PIDs and thus cannot directly apply those solutions

arXiv.org e-Print Archive

Crossref

IRIS UniversitÃ Politecnica delle Marche