12,430 research outputs found
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
RELEASE: A High-level Paradigm for Reliable Large-scale Server Software
Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the first six months. The project aim is to scale the Erlang’s radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the effectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene
Continuous maintenance and the future – Foundations and technological challenges
High value and long life products require continuous maintenance throughout their life cycle to achieve required performance with optimum through-life cost. This paper presents foundations and technologies required to offer the maintenance service. Component and system level degradation science, assessment and modelling along with life cycle ‘big data’ analytics are the two most important knowledge and skill base required for the continuous maintenance. Advanced computing and visualisation technologies will improve efficiency of the maintenance and reduce through-life cost of the product. Future of continuous maintenance within the Industry 4.0 context also identifies the role of IoT, standards and cyber security
Fail Over Strategy for Fault Tolerance in Cloud Computing Environment
YesCloud fault tolerance is an important issue in cloud computing platforms and applications. In the event of an unexpected
system failure or malfunction, a robust fault-tolerant design may allow the cloud to continue functioning correctly
possibly at a reduced level instead of failing completely. To ensure high availability of critical cloud services, the
application execution and hardware performance, various fault tolerant techniques exist for building self-autonomous
cloud systems. In comparison to current approaches, this paper proposes a more robust and reliable architecture using
optimal checkpointing strategy to ensure high system availability and reduced system task service finish time. Using
pass rates and virtualised mechanisms, the proposed Smart Failover Strategy (SFS) scheme uses components such as
Cloud fault manager, Cloud controller, Cloud load balancer and a selection mechanism, providing fault tolerance via
redundancy, optimized selection and checkpointing. In our approach, the Cloud fault manager repairs faults generated
before the task time deadline is reached, blocking unrecoverable faulty nodes as well as their virtual nodes. This scheme
is also able to remove temporary software faults from recoverable faulty nodes, thereby making them available for future
request. We argue that the proposed SFS algorithm makes the system highly fault tolerant by considering forward and
backward recovery using diverse software tools. Compared to existing approaches, preliminary experiment of the SFS
algorithm indicate an increase in pass rates and a consequent decrease in failure rates, showing an overall good
performance in task allocations. We present these results using experimental validation tools with comparison to other
techniques, laying a foundation for a fully fault tolerant IaaS Cloud environment
Performance modeling of fault-tolerant circuit-switched communication networks
Circuit switching (CS) has been suggested as an efficient switching method for supporting simultaneous communications (such as data, voice, and images) across parallel systems due to its ability to preserve both communication performance and fault-tolerant demands in such systems. In this paper we present an efficient scheme to capture the mean message latency in 2D torus with CS in the presence of faulty components. We have also conducted extensive simulation experiments, the results of which are used to validate the analytical mode
Recommended from our members
Optimising Fault Tolerance in Real-time Cloud Computing IaaS Environment
YesFault tolerance is the ability of a system to respond
swiftly to an unexpected failure. Failures in a cloud computing
environment are normal rather than exceptional, but fault
detection and system recovery in a real time cloud system is a
crucial issue. To deal with this problem and to minimize the risk
of failure, an optimal fault tolerance mechanism was introduced
where fault tolerance was achieved using the combination of the
Cloud Master, Compute nodes, Cloud load balancer, Selection
mechanism and Cloud Fault handler. In this paper, we proposed
an optimized fault tolerance approach where a model is designed
to tolerate faults based on the reliability of each compute node
(virtual machine) and can be replaced if the performance is not
optimal. Preliminary test of our algorithm indicates that the rate
of increase in pass rate exceeds the decrease in failure rate and it
also considers forward and backward recovery using diverse
software tools. Our results obtained are demonstrated through
experimental validation thereby laying a foundation for a fully
fault tolerant IaaS Cloud environment, which suggests a good
performance of our model compared to current existing
approaches.Petroleum Technology Development Fund (PTDF
- …