Search CORE

136 research outputs found

Designing SSI clusters with hierarchical checkpointing and single I/O space

Author: Chow E
Hwang K
Jin H
Wang CL
Xu Z
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

Adopting a new hierarchical checkpointing architecture, the authors develop a single I/O address space for building highly available clusters of computers. They propose a systematic approach to achieving a single system image by integrating existing middleware support with the newly developed features.published_or_final_versio

HKU Scholars Hub

Network Multicomputing Using Recoverable Distributed Shared Memory

Author: Carter John B.
Cox Alan L.
Dwarkadas Sandhya
Elnozahy Elmootazbellah N.
Johnson David B.
Keleher Pete
Zwaenepoel Willy
Publication venue
Publication date: 20/10/2005
Field of study

A network multicomputer is a multiprocessor in which the processors are connected by general-purpose networking technology, in contrast to current distributed memory multiprocessors where a dedicated special-purpose interconnect is used. The advent of high-speed general-purpose networks provides the impetus for a new look at the network multiprocessor model, by removing the bottleneck of current slow networks. However, major software issues remain unsolved. It is pointed out that a convenient machine abstraction must be developed that hides from the application programmer low-level details such as message passing or machine failures. Use is made of distributed shared memory as a programming abstraction, and rollback recovery through consistent checkpointing to provide fault tolerance. Measurements of the authors' implementations of distributed shared memory and consistent checkpointing show that these abstractions can be implemented efficientl

Infoscience - École polytechnique fédérale de Lausanne

Analysis of checkpointing schemes for multiprocessor systems

Author: Bruck Jehoshua
Ziv Avi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1993
Field of study

Parallel computing systems provide hardware redundancy that helps to achieve low cost fault-tolerance, by duplicating the task into more than a single processor, and comparing the states of the processors at checkpoints. This paper suggests a novel technique, based on a Markov Reward Model (MRM), for analyzing the performance of checkpointing schemes with task duplication. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. Our analytical results match well the values we obtained using a simulation program. We compare the average task execution time and total work of four checkpointing schemes, and show that generally increasing the number of processors reduces the average execution time, but increases the total work done by the processors. However, in cases where there is a big difference between the time it takes to perform different operations, those results can change

CiteSeerX

Caltech Authors

Middleware Fault Tolerance Support for the BOSS Embedded Operating System

Author: Afonso Francisco
Montenegro Sérgio
Silva Carlos A.
Tavares Adriano
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Critical embedded systems need a dependable operating system and application. Despite all efforts to prevent and remove faults in system development, residual software faults usually persist. Therefore, critical systems need some sort of fault tolerance to deal with these faults and also with hardware faults at operation time. This work proposes fault-tolerant support mechanisms for the BOSS embedded operating system, based on the application of proven fault tolerance strategies by middleware control software which transparently delivers the added functionality to the application software. Special attention is taken to complexity control and resource constraints, targeting the needs of the embedded market.Fundação para a Ciência e a Tecnologia (FCT

Universidade do Minho: RepositoriUM

Fraunhofer-ePrints

Common spaceborne multicomputer operating system and development environment

Author: Craymer L. G.
Hayes P. J.
Jones R. L.
Lewis B. F.
Publication venue
Publication date
Field of study

A preliminary technical specification for a multicomputer operating system is developed. The operating system is targeted for spaceborne flight missions and provides a broad range of real-time functionality, dynamic remote code-patching capability, and system fault tolerance and long-term survivability features. Dataflow concepts are used for representing application algorithms. Functional features are included to ensure real-time predictability for a class of algorithms which require data-driven execution on an iterative steady state basis. The development environment supports the development of algorithm code, design of control parameters, performance analysis, simulation of real-time dataflow applications, and compiling and downloading of the resulting application

NASA Technical Reports Server

Recommended from our members

Notes on the Implementation of a Remote Fork Mechanism

Author: Smith Jonathan M.
Ioannidis John
Publication venue: Department of Computer Science, Columbia University
Publication date: 01/01/1988
Field of study

We describe a method for implementing a remote fork, a primitive with the semantics of a UNIX fork() call which begins the execution of the child process on a remote machine. We begin by examining the subject of process migration, and conclude that most of the relevant process state can be captured and transferred to a remote system without operating system support. We then show how our implementation is easily optimized to achieve a performance improvement of greater than 10 times when measuring execution time. We conclude with some comments on limitations and applications of the remote fork mechanism

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Reparallelization and Migration of OpenMP Programs

Author: Matthias Bezold
Michael Klemm
Ronald Veldema
Stefan Gabriel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Typical computational grid users target only a single cluster and have to estimate the runtime of their jobs. Job schedulers prefer short-running jobs to maintain a high system utilization. If the user underestimates the runtime, premature termination causes computation loss; overesti-mation is penalized by long queue times. As a solution, we present an automatic reparallelization and migration of OpenMP applications. A reparallelization is dynamically computed for an OpenMP work distribution when the num-ber of CPUs changes. The application can be migrated between clusters when an allocated time slice is exceeded. Migration is based on a coordinated, heterogeneous check-pointing algorithm. Both reparallelization and migration enable the user to freely use computing time at more than a single point of the grid. Our demo applications successfully adapt to the changed CPU setting and smoothly migrate between, for example, clusters in Erlangen, Germany, and Amsterdam, the Netherlands, that use different processors. Benchmarks show that reparallelization and migration im-pose average overheads of about 4 % and 2%. 1

CiteSeerX

Crossref

Snowflake: Spanning administrative domains

Author: Howell Jon
Kotz David
Publication venue: Dartmouth Digital Commons
Publication date: 09/12/1998
Field of study

Many distributed systems provide a ``single-system image\u27\u27 to their users, so the user has the illusion that they are using a single system when in fact they are using many distributed resources. It is a powerful abstraction that helps users to manage the complexity of using distributed resources. The goal of the Snowflake project is to discover how single-system images can be made to span administrative domains. Our current prototype organizes resources in namespaces and distributes them using Java Remote Method Invocation. Challenging issues include how much flexibility should be built into the namespace interface, and how transparent the network and persistent storage should be. We outline future work on making Snowflake administrator-friendly

Dartmouth Digital Commons (Dartmouth College)