3 research outputs found

    Constraint Based System-Level Diagnosis of Multiprocessors

    Get PDF
    Massively parallel multiprocessors induce new requirements for system-level fault diagnosis, like handling a huge number of processing elements in an inhomogeneous system. Traditional diagnostic models (like PMC, BGM, etc.) are insufficient to fulfill all of these requirements. This paper presents a novel modelling technique, based on a special area of artificial intelligence (AI) methods: constraint satisfaction (CS). The constraint based approach is able to handle functional faults in a similar way to the Russel-Kime model. Moreover, it can use multiple-valued logic to deal with system components having multiple fault modes. The resolution of the produced models can be adjusted to fit the actual diagnostic goal. Consequently, constrint based methods are applicable to a much wider range of multiprocessor architectures than earlier models. The basic problem of system-level diagnosis, syndrome decoding, can be easily transformed into a constraint satisfaction problem (CSP). Thus, the diagnosis algorithm can be derived from the related constraint solving algorithm. Different abstraction leveles can be used for the various diagnosis resolutions, employing the same methodology. As examples, two algorithms are described in the paper; both of them is intended for the Parsytec GCel massively parallel system. The centralized method uses a more elaborate system model, and provides detailed diagnostic information, suitable for off-line evaluation. The distributed method makes fast decisions for reconfiguration control, using a simplified model. Keywords system-level self-diagnosis, massively parallel computing systems, constraint satisfaction, diagnostic models, centralized and distributed diagnostic algorithms

    Evaluating the performance of distributed agreement algorithms:tools, methodology and case studies

    Get PDF
    Nowadays, networked computers are present in most aspects of everyday life. Moreover, essential parts of society come to depend on distributed systems formed of networked computers, thus making such systems secure and fault tolerant is a top priority. If the particular fault tolerance requirement is high availability, replication of components is a natural choice. Replication is a difficult problem as the state of the replicas must be kept consistent even if some replicas fail, and because in distributed systems, relying on centralized control or a certain timing behavior is often not feasible. Replication in distributed systems is often implemented using group communication. Group communication is concerned with providing high-level multipoint communication primitives and the associated tools. Most often, an emphasis is put on tolerating crash failures of processes. At the heart of most communication primitives lies an agreement problem: the members of a group must agree on things like the set of messages to be delivered to the application, the delivery order of messages, or the set of processes that crashed. A lot of algorithms to solve agreement problems have been proposed and their correctness proven. However, performance aspects of agreement algorithms have been somewhat neglected, for a variety of reasons: the lack of theoretical and practical tools to help performance evaluation, and the lack of well-defined benchmarks for agreement algorithms. Also, most performance studies focus on analyzing failure free runs only. In our view, the limited understanding of performance aspects, in both failure free scenarios and scenarios with failure handling, is an obstacle for adopting agreement protocols in practice, and is part of the explanation why such protocols are not in widespread use in the industry today. The main goal of this thesis is to advance the state of the art in this field. The thesis has major contributions in three domains: new tools, methodology and performance studies. As for new tools, a simulation and prototyping framework offers a practical tool, and some new complexity metrics a theoretical tool for the performance evaluation of agreement algorithms. As for methodology, the thesis proposes a set of well-defined benchmarks for atomic broadcast algorithms (such algorithms are important as they provide the basis for a number of replication techniques). Finally, three studies are presented that investigate important performance issues with agreement algorithms. The prototyping and simulation framework simplifies the tedious task of developing algorithms based on message passing, the communication model that most agreement algorithms are written for. In this framework, the same implementation can be reused for simulations and performance measurements on a real network. This characteristic greatly eases the task of validating simulation results with measurements (or vice versa). As for theoretical tools, we introduce two complexity metrics that predict performance with more accuracy than the traditional time and message complexity metrics. The key point is that our metrics take account for resource contention, both on the network and the hosts; resource contention is widely recognized as having a major impact on the performance of distributed algorithms. Extensive validation studies have been conducted. Currently, no widely accepted benchmarks exist for agreement algorithms or group communication toolkits, which makes comparing performance results from different sources difficult. In an attempt to consolidate the situation, we define a number of benchmarks for atomic broadcast. Our benchmarks include well-defined metrics, workloads and failure scenarios (faultloads). The use of the benchmarks is illustrated in two detailed case studies. Two widespread mechanisms for handling failures are unreliable failure detectors which provide inconsistent information about failures, and a group membership service which provides consistent information about failures, respectively. We analyze the performance tradeoffs of these two techniques, by comparing the performance of two atomic broadcast algorithms designed for an asynchronous system. Based on our results, we advocate a combined use of the two approaches to failure handling. In another case study, we compare two consensus algorithms designed for an asynchronous system. The two algorithms differ in how they coordinate the decision process: the one uses a centralized and the other a decentralized communication schema. Our results show that the performance tradeoffs are highly affected by a number of characteristics of the environment, like the availability of multicast and the amount of contention on the hosts versus the amount of contention on the network. Famous theoretical results state that a lot of important agreement problems are not solvable in the asynchronous system model. In our third case study, we investigate how these results are relevant for implementations of a replicated service, by conducting an experiment in a local area network. We exposed a replicated server to extremely high loads and required that the underlying failure detection service detects crashes very fast; the latter is important as the theoretical results are based on the impossibility of reliable failure detection. We found that our replicated server continued working even with the most extreme settings. We discuss the reasons for the robustness of our replicated server

    Constraint Based System-Level Diagnosis of Multiprocessors

    No full text
    The paper presents a novel modelling technique for system-level fault diagnosis in massive parallel multiprocessors, based on a re-formulation of the problem of syndrome decoding to a constraint satisfaction problem (CSP). The CSP based approach is able to handle detailed and inhomogeneous functional fault models to a similar depth as the Russel-Kime model [18]. Multiple-valued logic is used to deal with system components having multiple fault modes. The granularity of the models can be adjusted to the target diagnostic resolution without altering the methodology. Two algorithms for the Parsytec GCel massively parallel system are used as illustration in the paper: the centralized method uses a detailed system model, and provides a fine-granular diagnostic image for off-line evaluation. The distributed method makes fast decisions for reconfiguration control, using a simplified model
    corecore