Search CORE

167,177 research outputs found

Distributed Recovery in Applicative Systems

Author: Keller Robert M.
Lin Frank C. H.
Publication venue: Scholarship @ Claremont
Publication date: 01/08/1986
Field of study

Applicative systems are promising candidates for achieving high performance computing through aggregation of processors. This paper studies the fault recovery problems in a class of applicative systems. The concept of functional checkpointing is proposed as the nucleus of a distributed recovery mechanism. This entails incrementally building a resilient structure as the evaluation of an applicative program proceeds. A simple rollback algorithm is suggested to regenerate the corrupted structure by redoing the most effective functional checkpoints. Another algorithm, which attempts to recover intermediate results, is also presented. The parent of a faulty task reproduces a functional twin of the failed task. The regenerated task inherits all offspring of the faulty task so that partial results can be salvaged

Scholarship@Claremont

Wait-Free Global Virtual Time Computation in Shared Memory Time-Warp Systems

Author: PELLEGRINI ALESSANDRO
QUAGLIA Francesco
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Global Virtual Time (GVT) is a powerful abstraction used to discriminate what events belong (and what do not belong) to the past history of a parallel/distributed computation. For high performance simulation systems based on the Time Warp synchronization protocol, where concurrent simulation objects are allowed to process their events speculatively and causal consistency is achieved via rollback/recovery techniques, GVT is used to determine which portion of the simulation can be considered as committed. Hence it is the base for actuating memory recovery (e.g. of obsolete logs that were taken in order to support state recoverability) and nonrevocable operations (e.g. I/O). For shared memory implementations of simulation platforms based on the Time Warp protocol, the reference GVT algorithm is the one presented by Fujimoto and Hybinette [1]. However, this algorithm relies on critical sections that make it non-wait-free, and which can hamper scalability. In this article we present a waitfree shared memory GVT algorithm that requires no critical section. Rather, correct coordination across the processes while computing the GVT value is achieved via memory atomic operations, namely compare-and-swap. The price paid by our proposal is an increase in the number of GVT computation phases, as opposed to the single phase required by the proposal in [1]. However, as we show via the results of an experimental study, the wait-free nature of the phases carried out in our GVT algorithm pays-off in reducing the actual cost incurred by the proposal in [1]

Crossref

ART

Archivio della ricerca- Università di Roma La Sapienza

Recommended from our members

A Decentralized Bayesian Algorithm for Distributed Compressive Sensing in Networked Sensing Systems

Author: Chen W
Wassell IJ
Publication venue: IEEE Transactions on Wireless Communications
Publication date: 01/01/2016
Field of study

Compressive sensing (CS), as a new sensing/sampling paradigm, facilitates signal acquisition by reducing the number of samples required for reconstruction of the original signal, and thus appears to be a promising technique for applications where the sampling cost is high, e.g., the Nyquist rate exceeds the current capabilities of analog-to-digital converters (ADCs). Conventional CS, although effective for dealing with one signal, only leverages the intra-signal correlation for reconstruction. This paper develops a decentralized Bayesian reconstruction algorithm for networked sensing systems to jointly reconstruct multiple signals based on the distributed compressive sensing (DCS) model that exploits both intra- and inter-signal correlations. The proposed approach is able to address networked sensing system applications with privacy concerns and/or for a fusion-centre-free scenario, where centralized approaches fail. Simulation results demonstrate that the proposed decentralized approaches have good recovery performance and converge reasonably quicklyThis is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/TWC.2015.248798

Apollo (Cambridge)

Improving Performance of Iterative Methods by Lossy Checkponting

Author: Acosta J. Mora
Agullo E.
Balay S.
Barrett R.
Barrett R.
Bode B.
Calhoun J.
Heath M. T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/05/2018
Field of study

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

arXiv.org e-Print Archive

Crossref

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine