19,219 research outputs found

    Measuring the BDARX architecture by agent oriented system a case study

    Get PDF
    Distributed systems are progressively designed as multi-agent systems that are helpful in designing high strength complex industrial software. Recently, distributed systems cooperative applications are openly access, dynamic and large scales. Nowadays, it hardly seems necessary to emphasis on the potential of decentralized software solutions. This is because the main benefit lies in the distributed nature of information, resources and action. On the other hand, the progression in multi agent systems creates new challenges to the traditional methodologies of fault-tolerance that typically relies on centralized and offline solution. Research on multi-agent systems had gained attention for designing software that operates in distributed and open environments, such as the Internet. DARX (Dynamic Agent Replication eXtension) is one of the architecture which aimed at building reliable software that would prove to be both flexible and scalable and also aimed to provide adaptive fault tolerance by using dynamic replication methodologies. Therefore, the enhancement of DARX known as BDARX can provide dynamic solution of byzantine faults for the agent based systems that embedded DARX. The BDARX architecture improves the fault tolerance ability of multi-agent systems in long run and strengthens the software to be more robust against such arbitrary faults. The BDARX provide the solution for the Byzantine fault tolerance in DARX by making replicas on the both sides of communication agents by using BFT protocol for agent systems instead of making replicas only on server end and assuming client as failure free. This paper shows that the dynamic behaviour of agents avoid us from making discrimination between server and client replicas

    Dynamic fault tolerant grid workflow in the water threat management project

    Get PDF
    Achieving fault tolerance is an inevitable problem in distributed systems, with it becoming more challenging in decentralized, heterogeneous, and dynamic-environment systems such as a Grid. When deploying applications requires time-criticality, how to allocate resources for jobs in a fault-tolerant manner is an important issue for the delivery of the services. The Water Threat Management project is a research to find solutions for the contamination incidents problems in urban water distribution systems, and it involves the development of the cyberinfrastructure in a Grid environment. To handle such urgent events properly, the deployment of the system demands real-time processing without the failure. Our approach of integrating a fault-tolerant framework into a Water Threat Management system provides fault tolerance at the queuing stage rather than the job-execution stage by scheduling jobs in fault-tolerant ways. This includes the development of the batch queuing system in the Cyberaide Shell project. In addition, we present a dynamic workflow in the Water Threat Management system that can reduce the queue wait time in the changing environment

    Self-Healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

    Get PDF
    Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process indistinguishable from a slow process. In the case of a slow process without fault, fault correction is undesirable as it can trigger new faults that could be prevented with fault tolerance that is a more proactive system maintenance. But in the case of an actual faulty process, fault tolerance alone without eventually correcting persistent faults can make systems underperforming. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several energy, transport and health applications. This paper contributes a novel and general-purpose modeling of fault scenarios during system runtime. They are used to accurately measure and predict inconsistencies generated by the undesirable outcomes of fault correction and fault tolerance as the means to improve self-healing of large-scale decentralized systems at the design phase. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds in a prototyped decentralized network of 3000 nodes. Almost 9 million measurements of inconsistencies were collected in a network, where each node monitors the health status of another node, while both can defect. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network data aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase and provide new insights how to tune self-healing at an early stage. Strikingly, the aggregation accuracy is well predicted as shown by high correlations and low root mean square errors

    Fault tolerance at system level based on RADIC architecture

    Get PDF
    The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries

    A Social Network-Based Peer-To-Peer Model For Resource Discovery

    Get PDF
    Peer-to-Peer (P2P) systems are distributed systems consisting of interconnected nodes which provide scalability, fault tolerance, decentralized coordination, self-organization, anonymity, distributed resources and services sharing, lower cost of ownership and better support for creating ad hoc networks. Data sharing, a subset of resource sharing, is one of the attractive topic in P2P systems. Because of autonomy of the nodes, decentralized coordination and volatility of network caused by the autonomy, data sharing is not an easy task in P2P system. Furthermore, there is no guarantee that a node stays in the network for a specific period of time. Hence, the answers to a particular query may be retrieved from different nodes every time. Moreover, the lack of centralized coordinators makes this process harder. These problems in P2P systems lead to a well known problem which is called resource discovery

    Model Prediction-Based Approach to Fault Tolerant Control with Applications

    Get PDF
    Abstract— Fault-tolerant control (FTC) is an integral component in industrial processes as it enables the system to continue robust operation under some conditions. In this paper, an FTC scheme is proposed for interconnected systems within an integrated design framework to yield a timely monitoring and detection of fault and reconfiguring the controller according to those faults. The unscented Kalman filter (UKF)-based fault detection and diagnosis system is initially run on the main plant and parameter estimation is being done for the local faults. This critical information\ud is shared through information fusion to the main system where the whole system is being decentralized using the overlapping decomposition technique. Using this parameter estimates of decentralized subsystems, a model predictive control (MPC) adjusts its parameters according to the\ud fault scenarios thereby striving to maintain the stability of the system. Experimental results on interconnected continuous time stirred tank reactors (CSTR) with recycle and quadruple tank system indicate that the proposed method is capable to correctly identify various faults, and then controlling the system under some conditions

    A Case Study on Formal Verification of Self-Adaptive Behaviors in a Decentralized System

    Full text link
    Self-adaptation is a promising approach to manage the complexity of modern software systems. A self-adaptive system is able to adapt autonomously to internal dynamics and changing conditions in the environment to achieve particular quality goals. Our particular interest is in decentralized self-adaptive systems, in which central control of adaptation is not an option. One important challenge in self-adaptive systems, in particular those with decentralized control of adaptation, is to provide guarantees about the intended runtime qualities. In this paper, we present a case study in which we use model checking to verify behavioral properties of a decentralized self-adaptive system. Concretely, we contribute with a formalized architecture model of a decentralized traffic monitoring system and prove a number of self-adaptation properties for flexibility and robustness. To model the main processes in the system we use timed automata, and for the specification of the required properties we use timed computation tree logic. We use the Uppaal tool to specify the system and verify the flexibility and robustness properties.Comment: In Proceedings FOCLASA 2012, arXiv:1208.432
    corecore