299 research outputs found
CIC : an integrated approach to checkpointing in mobile agent systems
Internet and Mobile Computing Lab (in Department of Computing)Refereed conference paper2006-2007 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe
An Optimizing Java Translation Framework for Automated Checkpointing and Strong Mobility
Long-running programs, e.g., in high-performance computing, need to
write periodic checkpoints of their execution state to disk to allow
them to recover from node failure. Manually adding checkpointing code
to an application, however, is very tedious. The mechanisms needed
for writing the execution state of a program to disk and restoring it
are similar to those needed for migrating a running thread or a mobile
object. We have extended a source-to-source translation scheme that
allows the migration of mobile Java objects with running threads to
make it more general and allow it to be used for automated
checkpointing. Our translation scheme allows serializable threads to
be written to disk or migrated with a mobile agent to a remote
machine. The translator generates code that maintains a serializable
run-time stack for each thread as a Java data structure. While this
results in significant run-time overhead, it allows the checkpointing
code to be generated automatically. We improved the locking mechanism
that is needed to protect the run-time stack as well as the translation
scheme. Our experimental results demonstrate an speedup of the
generated code over the original translator and show that the approach
is feasible in practice
Checkpoint placement algorithms for mobile agent system
2006-2007 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
An Analysis of Failure Handling in Chameleon, A Framework for Supporting Cost-Effective Fault Tolerant Services
The desire for low-cost reliable computing is increasing. Most current fault tolerant computing solutions are not very flexible, i.e., they cannot adapt to reliability requirements of newly emerging applications in business, commerce, and manufacturing. It is important that users have a flexible, reliable platform to support both critical and noncritical applications. Chameleon, under development at the Center for Reliable and High-Performance Computing at the University of Illinois, is a software framework. for supporting cost-effective adaptable networked fault tolerant service. This thesis details a simulation of fault injection, detection, and recovery in Chameleon. The simulation was written in C++ using the DEPEND simulation library. The results obtained from the simulation included the amount of overhead incurred by the fault detection and recovery mechanisms supported by Chameleon. In addition, information about fault scenarios from which Chameleon cannot recover was gained. The results of the simulation showed that both critical and noncritical applications can be executed in the Chameleon environment with a fairly small amount of overhead. No single point of failure from which Chameleon could not recover was found. Chameleon was also found to be capable of recovering from several multiple failure scenarios
Proxy Module for System on Mobile Devices (SyD) Middleware
Nowadays, users of mobile devices are growing. The users expect that they could communicate constantly using their mobile devices while they are also constantly moving. Therefore, there is a need to provide disconnection tolerance of transactions in the mobile devices’ platforms and its synchronization management. System on Mobile Devices (SyD) is taken as one of the examples of mobile devices’ platforms. The thesis studies the existing SyD architecture, from its framework into its kernel, and introduces the proxy module enhancement in SyD to handle disconnection tolerance, including its synchronization. SyD kernel has been extended for the purpose of enabling proxy module. SyDSync has been constructed for synchronization with the proxy. The timeout has been studied for seamless proxy invocation. A Camera application that tries to catch a stolen vehicle has been simulated for the practical purpose of using the proxy module extension
- …