27,702 research outputs found
Distributed operating systems
In the past five years, distributed operating systems research has gone through a consolidation phase. On a large number of design issues there is now considerable consensus between different research groups.\ud
\ud
In this paper, an overview of recent research in distributed systems is given. In turn, the paper discusses overall system structure, protection issues, file system designs, problems and solutions for fault tolerance and a mechanism that is rapidly becoming very important for efficient distributed systems design: hints.\ud
\ud
An attempt was made to provide sufficient references to interesting research projects for the reader to find material for more detailed study
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
A distributed file service based on optimistic concurrency control
The design of a layered file service for the Amoeba Distributed System is discussed, on top of which various applications can easily be intplemented. The bottom layer is formed by the Amoeba Block Services, responsible for implementing stable storage and repficated, highly available disk blocks. The next layer is formed by the Amoeba File Service which provides version management and concurrency control for tree-structured files. On top of this layer, the appficafions, ranging from databases to source code control systems, determine the structure of the file trees and provide an interface to the users
- …