Search CORE

27,702 research outputs found

Distributed operating systems

Author: Mullender Sape J.
Publication venue: North-Holland
Publication date: 01/01/1986
Field of study

In the past five years, distributed operating systems research has gone through a consolidation phase. On a large number of design issues there is now considerable consensus between different research groups.\ud \ud In this paper, an overview of recent research in distributed systems is given. In turn, the paper discusses overall system structure, protection issues, file system designs, problems and solutions for fault tolerance and a mechanism that is rapidly becoming very important for efficient distributed systems design: hints.\ud \ud An attempt was made to provide sufficient references to interesting research projects for the reader to find material for more detailed study

University of Twente Research Information

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

Author: D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

A distributed file service based on optimistic concurrency control

Author: Mullender Sape J.
Tanenbaum Andrew S.
Publication venue: ACM Press
Publication date: 01/01/1985
Field of study

The design of a layered file service for the Amoeba Distributed System is discussed, on top of which various applications can easily be intplemented. The bottom layer is formed by the Amoeba Block Services, responsible for implementing stable storage and repficated, highly available disk blocks. The next layer is formed by the Amoeba File Service which provides version management and concurrency control for tree-structured files. On top of this layer, the appficafions, ranging from databases to source code control systems, determine the structure of the file trees and provide an interface to the users

University of Twente Research Information