Search CORE

872 research outputs found

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Author: Alexandrov Vassil
McKee Gerard
Varghese Blesson
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

University of St. Andrews - Pure

St Andrews Research Repository

Exploring parallel MPI fault tolerance mechanisms for phylogenetic inference with RAxML-NG

Author: Hespe Demian
Hübner Lukas
Kozlov Alexey M.
Sanders Peter
Stamatakis Alexandros
Publication venue: Oxford University Press
Publication date: 26/05/2021
Field of study

Motivation Phylgenetic trees are now routinely inferred on large scale high performance computing systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood-based phylogenetic tree inference. Results We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 1.00 ± 0.04. The overall slowdown by using these recovery mechanisms in conjunction with a fault-tolerant Message Passing Interface implementation amounts to on average 1.7 ± 0.6 for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery and failures during checkpointing. Recoveries are automatic and transparent to the user

KITopen

PubMed Central

3rd EGEE User Forum

Author: Floros Vangelis
Goisset Anne Lise
Harris Frank
Kereksizova Merim
Publication venue: EGEE
Publication date: 01/01/2008
Field of study

We have organized this book in a sequence of chapters, each chapter associated with an application or technical theme introduced by an overview of the contents, and a summary of the main conclusions coming from the Forum for the chapter topic. The first chapter gathers all the plenary session keynote addresses, and following this there is a sequence of chapters covering the application flavoured sessions. These are followed by chapters with the flavour of Computer Science and Grid Technology. The final chapter covers the important number of practical demonstrations and posters exhibited at the Forum. Much of the work presented has a direct link to specific areas of Science, and so we have created a Science Index, presented below. In addition, at the end of this book, we provide a complete list of the institutes and countries involved in the User Forum

CERN Document Server

Proceedings of the CoreGRID Workshop on Grid Systems, Tools and Environments, 1st December 2006, Sophia-Antipolis, France

Author: Badia R.M.
Badia R.M.
Baude F.
Baude F.
Getov Vladimir
Getov Vladimir
Kielmann T.
Kielmann T.
Taylor I.
Taylor I.
Publication venue: CoreGRID
Publication date: 14/09/2007
Field of study

WestminsterResearch

Recommended from our members

FabSim3: An automation toolkit for verified simulations using high performance computing

Author: Arabnejad H
Bronik K
Coveney P
Edeling W
Groen D
Monnier N
Raffin E
Suleimenova D
Xue Y
Publication venue: 'Elsevier BV'
Publication date: 24/11/2022
Field of study

A common feature of computational modelling and simulation research is the need to perform many tasks in complex sequences to achieve a usable result. This will typically involve tasks such as preparing input data, pre-processing, running simulations on a local or remote machine, post-processing, and performing coupling communications, validations and/or optimisations. Tasks like these can involve manual steps which are time and effort intensive, especially when it involves the management of large ensemble runs. Additionally, human errors become more likely and numerous as the research work becomes more complex, increasing the risk of damaging the credibility of simulation results. Automation tools can help ensure the credibility of simulation results by reducing the manual time and effort required to perform these research tasks, by making more rigorous procedures tractable, and by reducing the probability of human error due to a reduced number of manual actions. In addition, efficiency gained through automation can help researchers to perform more research within the budget and effort constraints imposed by their projects. This paper presents the main software release of FabSim3, and explains how our automation toolkit can improve and simplify a range of tasks for researchers and application developers. FabSim3 helps to prepare, submit, execute, retrieve, and analyze simulation workflows. By providing a suitable level of abstraction, FabSim3 reduces the complexity of setting up and managing a large-scale simulation scenario, while still providing transparent access to the underlying layers for effective debugging. The tool also facilitates job submission and management (including staging and curation of files and environments) for a range of different supercomputing environments. Although FabSim3 itself is application-agnostic, it supports a provably extensible plugin system where users automate simulation and analysis workflows for their own application domains. To highlight this, we briefly describe a selection of these plugins and we demonstrate the efficiency of the toolkit in handling large ensemble workflows.EPSRC under grant agreement EP/W007711/1, as well as by the VECMA and HiDALGO projects, which have received funding from the European Union Horizon 2020 research and innovation programme under grant agreement nos 800925 and 824115. In addition, FabFlee was supported by the ITFLOWS project and FabCovid19 by the STAMINA project, both of which have received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 882986 and No 883441 respectivel

Brunel University Research Archive

Network on chip architecture for multi-agent systems in FPGA

Author: Belatreche A
Coleman S
Gerlein EA
McGinnity TM
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/11/2017
Field of study

A system of interacting agents is, by definition, very demanding in terms of computational resources. Although multi-agent systems have been used to solve complex problems in many areas, it is usually very difficult to perform large-scale simulations in their targeted serial computing platforms. Reconfigurable hardware, in particular Field Programmable Gate Arrays (FPGA) devices, have been successfully used in High Performance Computing applications due to their inherent flexibility, data parallelism and algorithm acceleration capabilities. Indeed, reconfigurable hardware seems to be the next logical step in the agency paradigm, but only a few attempts have been successful in implementing multi-agent systems in these platforms. This paper discusses the problem of inter-agent communications in Field Programmable Gate Arrays. It proposes a Network-on-Chip in a hierarchical star topology to enable agents’ transactions through message broadcasting using the Open Core Protocol, as an interface between hardware modules. A customizable router microarchitecture is described and a multi-agent system is created to simulate and analyse message exchanges in a generic heavy traffic load agent-based application. Experiments have shown a throughput of 1.6Gbps per port at 100 MHz without packet loss and seamless scalability characteristics

Northumbria Research Link

Nottingham Trent Institutional Repository (IRep)

Ulster University's Research Portal

Artificial Intelligence for Small Satellites Mission Autonomy

Author: FERUGLIO LORENZO
Publication venue: country:Italy
Publication date: 01/01/2017
Field of study

Space mission engineering has always been recognized as a very challenging and innovative branch of engineering: since the beginning of the space race, numerous milestones, key successes and failures, improvements, and connections with other engineering domains have been reached. Despite its relative young age, space engineering discipline has not gone through homogeneous times: alternation of leading nations, shifts in public and private interests, allocations of resources to different domains and goals are all examples of an intrinsic dynamism that characterized this discipline. The dynamism is even more striking in the last two decades, in which several factors contributed to the fervour of this period. Two of the most important ones were certainly the increased presence and push of the commercial and private sector and the overall intent of reducing the size of the spacecraft while maintaining comparable level of performances. A key example of the second driver is the introduction, in 1999, of a new category of space systems called CubeSats. Envisioned and designed to ease the access to space for universities, by standardizing the development of the spacecraft and by ensuring high probabilities of acceptance as piggyback customers in launches, the standard was quickly adopted not only by universities, but also by agencies and private companies. CubeSats turned out to be a disruptive innovation, and the space mission ecosystem was deeply changed by this. New mission concepts and architectures are being developed: CubeSats are now considered as secondary payloads of bigger missions, constellations are being deployed in Low Earth Orbit to perform observation missions to a performance level considered to be only achievable by traditional, fully-sized spacecraft. CubeSats, and more in general the small satellites technology, had to overcome important challenges in the last few years that were constraining and reducing the diffusion and adoption potential of smaller spacecraft for scientific and technology demonstration missions. Among these challenges were: the miniaturization of propulsion technologies, to enable concepts such as Rendezvous and Docking, or interplanetary missions; the improvement of telecommunication state of the art for small satellites, to enable the downlink to Earth of all the data acquired during the mission; and the miniaturization of scientific instruments, to be able to exploit CubeSats in more meaningful, scientific, ways. With the size reduction and with the consolidation of the technology, many aspects of a space mission are reduced in consequence: among these, costs, development and launch times can be cited. An important aspect that has not been demonstrated to scale accordingly is operations: even for small satellite missions, human operators and performant ground control centres are needed. In addition, with the possibility of having constellations or interplanetary distributed missions, a redesign of how operations are management is required, to cope with the innovation in space mission architectures. The present work has been carried out to address the issue of operations for small satellite missions. The thesis presents a research, carried out in several institutions (Politecnico di Torino, MIT, NASA JPL), aimed at improving the autonomy level of space missions, and in particular of small satellites. The key technology exploited in the research is Artificial Intelligence, a computer science branch that has gained extreme interest in research disciplines such as medicine, security, image recognition and language processing, and is currently making its way in space engineering as well. The thesis focuses on three topics, and three related applications have been developed and are here presented: autonomous operations by means of event detection algorithms, intelligent failure detection on small satellite actuator systems, and decision-making support thanks to intelligent tradespace exploration during the preliminary design of space missions. The Artificial Intelligent technologies explored are: Machine Learning, and in particular Neural Networks; Knowledge-based Systems, and in particular Fuzzy Logics; Evolutionary Algorithms, and in particular Genetic Algorithms. The thesis covers the domain (small satellites), the technology (Artificial Intelligence), the focus (mission autonomy) and presents three case studies, that demonstrate the feasibility of employing Artificial Intelligence to enhance how missions are currently operated and designed

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino