872 research outputs found

    Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

    Get PDF
    Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

    Exploring parallel MPI fault tolerance mechanisms for phylogenetic inference with RAxML-NG

    Get PDF
    Motivation Phylgenetic trees are now routinely inferred on large scale high performance computing systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood-based phylogenetic tree inference. Results We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 1.00 ± 0.04. The overall slowdown by using these recovery mechanisms in conjunction with a fault-tolerant Message Passing Interface implementation amounts to on average 1.7 ± 0.6 for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery and failures during checkpointing. Recoveries are automatic and transparent to the user

    3rd EGEE User Forum

    Get PDF
    We have organized this book in a sequence of chapters, each chapter associated with an application or technical theme introduced by an overview of the contents, and a summary of the main conclusions coming from the Forum for the chapter topic. The first chapter gathers all the plenary session keynote addresses, and following this there is a sequence of chapters covering the application flavoured sessions. These are followed by chapters with the flavour of Computer Science and Grid Technology. The final chapter covers the important number of practical demonstrations and posters exhibited at the Forum. Much of the work presented has a direct link to specific areas of Science, and so we have created a Science Index, presented below. In addition, at the end of this book, we provide a complete list of the institutes and countries involved in the User Forum

    Network on chip architecture for multi-agent systems in FPGA

    Get PDF
    A system of interacting agents is, by definition, very demanding in terms of computational resources. Although multi-agent systems have been used to solve complex problems in many areas, it is usually very difficult to perform large-scale simulations in their targeted serial computing platforms. Reconfigurable hardware, in particular Field Programmable Gate Arrays (FPGA) devices, have been successfully used in High Performance Computing applications due to their inherent flexibility, data parallelism and algorithm acceleration capabilities. Indeed, reconfigurable hardware seems to be the next logical step in the agency paradigm, but only a few attempts have been successful in implementing multi-agent systems in these platforms. This paper discusses the problem of inter-agent communications in Field Programmable Gate Arrays. It proposes a Network-on-Chip in a hierarchical star topology to enable agents’ transactions through message broadcasting using the Open Core Protocol, as an interface between hardware modules. A customizable router microarchitecture is described and a multi-agent system is created to simulate and analyse message exchanges in a generic heavy traffic load agent-based application. Experiments have shown a throughput of 1.6Gbps per port at 100 MHz without packet loss and seamless scalability characteristics

    Artificial Intelligence for Small Satellites Mission Autonomy

    Get PDF
    Space mission engineering has always been recognized as a very challenging and innovative branch of engineering: since the beginning of the space race, numerous milestones, key successes and failures, improvements, and connections with other engineering domains have been reached. Despite its relative young age, space engineering discipline has not gone through homogeneous times: alternation of leading nations, shifts in public and private interests, allocations of resources to different domains and goals are all examples of an intrinsic dynamism that characterized this discipline. The dynamism is even more striking in the last two decades, in which several factors contributed to the fervour of this period. Two of the most important ones were certainly the increased presence and push of the commercial and private sector and the overall intent of reducing the size of the spacecraft while maintaining comparable level of performances. A key example of the second driver is the introduction, in 1999, of a new category of space systems called CubeSats. Envisioned and designed to ease the access to space for universities, by standardizing the development of the spacecraft and by ensuring high probabilities of acceptance as piggyback customers in launches, the standard was quickly adopted not only by universities, but also by agencies and private companies. CubeSats turned out to be a disruptive innovation, and the space mission ecosystem was deeply changed by this. New mission concepts and architectures are being developed: CubeSats are now considered as secondary payloads of bigger missions, constellations are being deployed in Low Earth Orbit to perform observation missions to a performance level considered to be only achievable by traditional, fully-sized spacecraft. CubeSats, and more in general the small satellites technology, had to overcome important challenges in the last few years that were constraining and reducing the diffusion and adoption potential of smaller spacecraft for scientific and technology demonstration missions. Among these challenges were: the miniaturization of propulsion technologies, to enable concepts such as Rendezvous and Docking, or interplanetary missions; the improvement of telecommunication state of the art for small satellites, to enable the downlink to Earth of all the data acquired during the mission; and the miniaturization of scientific instruments, to be able to exploit CubeSats in more meaningful, scientific, ways. With the size reduction and with the consolidation of the technology, many aspects of a space mission are reduced in consequence: among these, costs, development and launch times can be cited. An important aspect that has not been demonstrated to scale accordingly is operations: even for small satellite missions, human operators and performant ground control centres are needed. In addition, with the possibility of having constellations or interplanetary distributed missions, a redesign of how operations are management is required, to cope with the innovation in space mission architectures. The present work has been carried out to address the issue of operations for small satellite missions. The thesis presents a research, carried out in several institutions (Politecnico di Torino, MIT, NASA JPL), aimed at improving the autonomy level of space missions, and in particular of small satellites. The key technology exploited in the research is Artificial Intelligence, a computer science branch that has gained extreme interest in research disciplines such as medicine, security, image recognition and language processing, and is currently making its way in space engineering as well. The thesis focuses on three topics, and three related applications have been developed and are here presented: autonomous operations by means of event detection algorithms, intelligent failure detection on small satellite actuator systems, and decision-making support thanks to intelligent tradespace exploration during the preliminary design of space missions. The Artificial Intelligent technologies explored are: Machine Learning, and in particular Neural Networks; Knowledge-based Systems, and in particular Fuzzy Logics; Evolutionary Algorithms, and in particular Genetic Algorithms. The thesis covers the domain (small satellites), the technology (Artificial Intelligence), the focus (mission autonomy) and presents three case studies, that demonstrate the feasibility of employing Artificial Intelligence to enhance how missions are currently operated and designed
    • …
    corecore