Search CORE

17,005 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Software engineering and middleware: a roadmap (Invited talk)

Author: Emmerich W.
Publication venue: ACM Press
Publication date: 01/01/2000
Field of study

The construction of a large class of distributed systems can be simplified by leveraging middleware, which is layered between network operating systems and application components. Middleware resolves heterogeneity and facilitates communication and coordination of distributed components. Existing middleware products enable software engineers to build systems that are distributed across a local-area network. State-of-the-art middleware research aims to push this boundary towards Internet-scale distribution, adaptive and reconfigurable middleware and middleware for dependable and wireless systems. The challenge for software engineering research is to devise notations, techniques, methods and tools for distributed system construction that systematically build and exploit the capabilities that middleware deliver

UCL Discovery

A hierarchical distributed control model for coordinating intelligent systems

Author: Adler Richard M.
Publication venue
Publication date
Field of study

A hierarchical distributed control (HDC) model for coordinating cooperative problem-solving among intelligent systems is described. The model was implemented using SOCIAL, an innovative object-oriented tool for integrating heterogeneous, distributed software systems. SOCIAL embeds applications in 'wrapper' objects called Agents, which supply predefined capabilities for distributed communication, control, data specification, and translation. The HDC model is realized in SOCIAL as a 'Manager'Agent that coordinates interactions among application Agents. The HDC Manager: indexes the capabilities of application Agents; routes request messages to suitable server Agents; and stores results in a commonly accessible 'Bulletin-Board'. This centralized control model is illustrated in a fault diagnosis application for launch operations support of the Space Shuttle fleet at NASA, Kennedy Space Center

NASA Technical Reports Server

Coordinating complex problem-solving among distributed intelligent agents

Author: Adler Richard M.
Publication venue
Publication date
Field of study

A process-oriented control model is described for distributed problem solving. The model coordinates the transfer and manipulation of information across independent networked applications, both intelligent and conventional. The model was implemented using SOCIAL, a set of object-oriented tools for distributing computing. Complex sequences of distributed tasks are specified in terms of high level scripts. Scripts are executed by SOCIAL objects called Manager Agents, which realize an intelligent coordination model that routes individual tasks to suitable server applications across the network. These tools are illustrated in a prototype distributed system for decision support of ground operations for NASA's Space Shuttle fleet

NASA Technical Reports Server

Correct and Control Complex IoT Systems: Evaluation of a Classification for System Anomalies

Author: Heisse Stefan
Niedermaier Sina
Wagner Stefan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/05/2020
Field of study

In practice there are deficiencies in precise interteam communications about system anomalies to perform troubleshooting and postmortem analysis along different teams operating complex IoT systems. We evaluate the quality in use of an adaptation of IEEE Std. 1044-2009 with the objective to differentiate the handling of fault detection and fault reaction from handling of defect and its options for defect correction. We extended the scope of IEEE Std. 1044-2009 from anomalies related to software only to anomalies related to complex IoT systems. To evaluate the quality in use of our classification a study was conducted at Robert Bosch GmbH. We applied our adaptation to a postmortem analysis of an IoT solution and evaluated the quality in use by conducting interviews with three stakeholders. Our adaptation was effectively applied and interteam communications as well as iterative and inductive learning for product improvement were enhanced. Further training and practice are required.Comment: Submitted to QRS 2020 (IEEE Conference on Software Quality, Reliability and Security

arXiv.org e-Print Archive

Crossref