Search CORE

918 research outputs found

Doing-it-All with Bounded Work and Communication

Author: Alistarh
Alistarh
Alon
Birman
Birman
Bridgland
Cachin
Censor-Hillel
Chlebus
Chlebus
Chlebus
Chlebus
Chlebus
Chlebus
Chlebus
Chlebus
Chlebus
Chung
Clementi
Davidoff
Davtyan
Davtyan
Davtyan
De Prisco
Diks
Drucker
Dwork
Dwork
Fernández
Galil
Georgiou
Georgiou
Georgiou
Georgiou
Georgiou
Georgiou
Goldberg
Kanellakis
Kentros
Kentros
Kentros
Kontogiannis
Kowalski
Kowalski
Lamport
Lubotzky
Margulis
Mitzenmacher
Pippenger
Saks
Tanner
Upfal
Publication venue: 'Elsevier BV'
Publication date: 01/06/2017
Field of study

We consider the Do-All problem, where

p

cooperating processors need to complete

t

similar and independent tasks in an adversarial setting. Here we deal with a synchronous message passing system with processors that are subject to crash failures. Efficiency of algorithms in this setting is measured in terms of work complexity (also known as total available processor steps) and communication complexity (total number of point-to-point messages). When work and communication are considered to be comparable resources, then the overall efficiency is meaningfully expressed in terms of effort defined as work + communication. We develop and analyze a constructive algorithm that has work

O( t + p \log p\, (\sqrt{p\log p}+\sqrt{t\log t}\, ) )

and a nonconstructive algorithm that has work

O(t +p \log^2 p)

. The latter result is close to the lower bound

\Omega(t + p \log p/ \log \log p)

on work. The effort of each of these algorithms is proportional to its work when the number of crashes is bounded above by

c\,p

, for some positive constant

c < 1

. We also present a nonconstructive algorithm that has effort

O(t + p ^{1.77})

arXiv.org e-Print Archive

University of Liverpool Repository

Crossref

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Author: Alexandrov Vassil
McKee Gerard
Varghese Blesson
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

University of St. Andrews - Pure

St Andrews Research Repository

CHECKPOINTING AND RECOVERY IN DISTRIBUTED AND DATABASE SYSTEMS

Author: Wu Jiang
Publication venue: UKnowledge
Publication date: 01/01/2011
Field of study

A transaction-consistent global checkpoint of a database records a state of the database which reflects the effect of only completed transactions and not the re- sults of any partially executed transactions. This thesis establishes the necessary and sufficient conditions for a checkpoint of a data item (or the checkpoints of a set of data items) to be part of a transaction-consistent global checkpoint of the database. This result would be useful for constructing transaction-consistent global checkpoints incrementally from the checkpoints of each individual data item of a database. By applying this condition, we can start from any useful checkpoint of any data item and then incrementally add checkpoints of other data items until we get a transaction- consistent global checkpoint of the database. This result can also help in designing non-intrusive checkpointing protocols for database systems. Based on the intuition gained from the development of the necessary and sufficient conditions, we also de- veloped a non-intrusive low-overhead checkpointing protocol for distributed database systems. Checkpointing and rollback recovery are also established techniques for achiev- ing fault-tolerance in distributed systems. Communication-induced checkpointing algorithms allow processes involved in a distributed computation take checkpoints independently while at the same time force processes to take additional checkpoints to make each checkpoint to be part of a consistent global checkpoint. This thesis develops a low-overhead communication-induced checkpointing protocol and presents a performance evaluation of the protocol

University of Kentucky

CIC : an integrated approach to checkpointing in mobile agent systems

Author: Cao J
Wu W
Yang J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/12/2014
Field of study

Internet and Mobile Computing Lab (in Department of Computing)Refereed conference paper2006-2007 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe

PolyU Institutional Repository