    Efficient diagnosis of multiprocessor systems under probabilistic models

    The problem of fault diagnosis in multiprocessor systems is considered under a probabilistic fault model. The focus is on minimizing the number of tests that must be conducted in order to correctly diagnose the state of every processor in the system with high probability. A diagnosis algorithm that can correctly diagnose the state of every processor with probability approaching one in a class of systems performing slightly greater than a linear number of tests is presented. A nearly matching lower bound on the number of tests required to achieve correct diagnosis in arbitrary systems is also proven. Lower and upper bounds on the number of tests required for regular systems are also presented. A class of regular systems which includes hypercubes is shown to be correctly diagnosable with high probability. In all cases, the number of tests required under this probabilistic model is shown to be significantly less than under a bounded-size fault set model. Because the number of tests that must be conducted is a measure of the diagnosis overhead, these results represent a dramatic improvement in the performance of system-level diagnosis techniques

    Attack monitoring and localization in an all-optical network

    An All-Optical Network (AON) is a network in which data does not undergo optical-to-electrical (O-E) or electrical-to-optical (E-O) conversion within the network. Although AONs are a viable technology for future telecommunication and data networks, little attentions has been devoted to the intrinsic differences between AONs and existing existing electro-optic/electronic networks in issues of security management. Without. O-E-O conversion, many security vulnerabilities that do not exist in traditional networks are created. Transparency and non-regeneration features make attack detection and localization difficult. However, it is important to detect and localize an attack connection quickly in a transparent AON;Among all attack methods, crosstalk attack has the highest damage capabilities. Therefore, we specifically focus on crosstalk attacks in this dissertation. We show that it is possible to effectively reduce the number of monitors while still retaining all diagnostic capabilities. We make the following contributions: (1) We provide a crosstalk attack model and a monitoring model. (2) Based on these models, we prove necessary and sufficient conditions for a both one attack and more than one (i.e., k-crosstalk) attack diagnostic network. The key ideas used in our solution are to employ the status of connections as diagnostic data. (3) We develop efficient monitor placement policies, test connection setup policies, and routing policies for such a network. These conditions lead to efficient k-attack detection and diagnosis algorithms. (4) Finally, we analyze the performance of these algorithms;By these conditions and policies, we prove that the concept of a sparse monitor system for monitoring and localizing crosstalk attacks in AON is not only possible but also feasible

    GA-Based fault diagnosis algorithms for distributed systems

    Distributed Systems are becoming very popular day-by-day due to their applications in various fields such as electronic automotives, remote environment control like underwater sensor network, K-connected networks. Faults may aect the nodes of the system at any time. So diagnosing the faulty nodes in the distributed system is an worst necessity to make the system more reliable and ecient. This thesis describes about dierent types of faults, system and fault model, those are already in literature. As the evolutionary approaches give optimum outcome than probabilistic approaches, we have developed Genetic algorithm based fault diagnosis algorithm which provides better result than other fault diagnosis algorithms. The GA-based fault diagnosis algorithm has worked upon dierent types of faults like permanent as well as intermittent faults in a K-connected system. Simulation results demonstrate that the proposed Genetic Algorithm Based Permanent Fault Diagnosis Algorithm(GAPFDA) and Genetic Algorithm Based Intermittent Fault Diagnosis Algorithm (GAIFDA) decreases the number of messages transferred and the time needed to diagnose the faulty nodes in a K-connected distributed system. The decrease in CPU time and number of steps are due to the application of supervised mutation in the fault diagnosis algorithms. The time complexity and message complexity of GAPFDA are analyzed as O(n*P*K*ng) and O(n*K) respectively. The time complexity and message complexity of GAIFDA are O(r*n*P*K*ng) and O(r*n*K) respectively, where ’n’ is the number of nodes, ’P’ is the population size, ’K’ is the connectivity of the network, ’ng’ is the number of generations (steps), ’r’ is the number of rounds. Along with the design of fault diagnosis algorithm of O(r*k) for diagnosing the transient-leading-to-permanent faults in the actuators of a k-fault tolerant Fly-by-wire(FBW) system, an ecient scheduling algorithm has been developed to schedule dierent tasks of a FBW system, here ’r’ denotes the number of rounds. The proposed algorithm for scheduling the task graphs of a multi-rate FBW system demonstrates that, maximization in microcontroller’s execution period reduces the number of microcontrollers needed for performing diagnosis

    Multilevel distributed diagnosis and the design of a distributed network fault detection system based on the SNMP protocol.

    In this thesis, we propose a new distributed diagnosis algorithm using the multilevel paradigm. This algorithm is a generalization of both the ADSD and Hi-ADSD algorithms. We present all details of the design and implementation of this multilevel adaptive distributed diagnosis algorithm called the ML-ADSD algorithm. We also present extensive simulation results comparing the performance of these three algorithms.In 1967, Preparata, Metze and Chien proposed a model and a framework for diagnosing faulty processors in a multiprocessor system. To exploit the inherent parallelism available in a multiprocessor system and thereby improving fault tolerance, Kuhl and Reddy, in 1980, pioneered a new area of research known as distributed system level diagnosis. Following this pioneering work, in 1991, Bianchini and Buskens proposed an adaptive distributed algorithm to diagnose fully connected networks. This algorithm called the ADSD algorithm has a diagnosis latency of O(N) testing rounds for a network with N nodes. With a view to improving the diagnosis latency of the ADSD algorithm, in 1998 Duarte and Nanya proposed a hierarchical distributed diagnosis algorithm for fully connected networks. This algorithm called the Hi-ADSD algorithm has a diagnosis latency of O(log2N) testing rounds. The Hi-ADSD algorithm can be viewed as a generalization of the ADSD algorithm.In all cases, the time required by the ML-ADSD algorithm is better than or the same as for the Hi-ADSD algorithm. The performance of the ML-ADSD algorithm can be improved by an appropriate choice of the number of clusters and the number of levels. Also, the ML-ADSD algorithm is scalable in the sense that only some minor modifications will be required to adapt the algorithm to networks of varying sizes. This property is not shared by the Hi-ADSD algorithm. The primary application of our research is to develop and implement a prototype network fault detection/monitoring system by integrating the ML-ADSD algorithm into a SNMP-based (Simple Network Management Protocol) fault management system. We report the details of the design and implementation of such a distributed network fault detection system

    Assinalamentos de testes para um algoritmo de diagnóstico em nível de sistema para redes de sensores sem fio

    Resumo: Este trabalho se propõe a comparar três abordagens de construção de assinalamentos de testes para um algoritmo de diagnóstico em nível de sistema. As abordagens apresentadas visam o problema da detecção de alarmes falsos (falsos positivos) em uma rede de sensores sem ó onde os sensores monitoram o ambiente com o objetivo de gerar alarmes sobre a ocorrência de determinados eventos. Considere uma rede de sensores onde um conjunto de t sensores próximos geograficamente enviam sinais de alarme a uma unidade central da rede, com maior capacidade de processamento, chamada sink, informando a detecção de determinado fenômeno. Para garantir que os alarmes gerados não são falsos, o sink solicita a execução de testes mútuos entre os sensores presentes na região que contém os nodos que reportaram os alarmes. O resultado dos testes é enviado ao sink que, então, utiliza um algoritmo de diagnóstico em nível de sistema para identificar os sensores falhos. O algoritmo de diagnóstico é bem sucedido na execução desta tarefa se os testes executados pelos sensores são suficientes para alcançar determinada diagnosticabilidade do sistema, a qual depende de propriedades topológicas da rede de sensores e de certas condições presentes na literatura para formar assinalamentos de teste t-diagnosticáveis. Este trabalho apresenta três estratégias de testes que asseguram que a iagnosticabilidade desejada para o sistema seja alcançada com um consumo minimizado de energia. Resultados experimentais avaliam o comportamento das estratégias e comparam o consumo de energia apresentado entre elas em redes com diferentes topologias e densidades, com diferentes valores de t e com variações na distância entre os sensores que geram alarmes

    Scalable fault management architecture for dynamic optical networks : an information-theoretic approach

    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.MIT Barker Engineering Library copy: printed in pages.Also issued printed in pages.Includes bibliographical references (leaves 255-262).All-optical switching, in place of electronic switching, of high data-rate lightpaths at intermediate nodes is one of the key enabling technologies for economically scalable future data networks. This replacement of electronic switching with optical switching at intermediate nodes, however, presents new challenges for fault detection and localization in reconfigurable all-optical networks. Presently, fault detection and localization techniques, as implemented in SONET/G.709 networks, rely on electronic processing of parity checks at intermediate nodes. If similar techniques are adapted to all-optical reconfigurable networks, optical signals need to be tapped out at intermediate nodes for parity checks. This additional electronic processing would break the all-optical transparency paradigm and thus significantly diminish the cost advantages of all-optical networks. In this thesis, we propose new fault-diagnosis approaches specifically tailored to all-optical networks, with an objective of keeping the diagnostic capital expenditure and the diagnostic operation effort low. Instead of the aforementioned passive monitoring paradigm based on parity checks, we propose a proactive lightpath probing paradigm: optical probing signals are sent along a set of lightpaths in the network, and network state (i.e., failure pattern) is then inferred from testing results of this set of end-to-end lightpath measurements. Moreover, we assume that a subset of network nodes (up to all the nodes) is equipped with diagnostic agents - including both transmitters/receivers for probe transmission/detection and software processes for probe management to perform fault detection and localization. The design objectives of this proposed proactive probing paradigm are two folded: i) to minimize the number of lightpath probes to keep the diagnostic operational effort low, and ii) to minimize the number of diagnostic hardware to keep the diagnostic capital expenditure low.(cont.) The network fault-diagnosis problem can be mathematically modeled with a group testing-over-graphs framework. In particular, the network is abstracted as a graph in which the failure status of each node/link is modeled with a random variable (e.g. Bernoulli distribution). A probe over any path in the graph results in a value, defined as the probe syndrome, which is a function of all the random variables associated in that path. A network failure pattern is inferred through a set of probe syndromes resulting from a set of optimally chosen probes. This framework enriches the traditional group-testing problem by introducing a topological structure, and can be extended to model many other network-monitoring problems (e.g., packet delay, packet drop ratio, noise and etc) by choosing appropriate state variables. Under the group-testing-over-graphs framework with a probabilistic failure model, we initiate an information-theoretic approach to minimizing the average number of lightpath probes to identify all possible network failure patterns. Specifically, we have established an isomorphic mapping between the fault-diagnosis problem in network management and the source-coding problem in Information Theory. This mapping suggests that the minimum average number of lightpath probes required is lower bounded by the information entropy of the network state and efficient source-coding algorithms (e.g. the run-length code) can be translated into scalable fault-diagnosis schemes under some additional probe feasibility constraint. Our analytical and numerical investigations yield a guideline for designing scalable fault-diagnosis algorithms: each probe should provide approximately 1-bit of state information, and thus the total number of probes required is approximately equal to the entropy of the network state.(cont.) To address the hardware cost of diagnosis, we also developed a probabilistic analysis framework to characterize the trade-off between hardware cost (i.e., the number of nodes equipped with Tx/Rx pairs) and diagnosis capability (i.e., the probability of successful failure detection and localization). Our results suggest that, for practical situations, the hardware cost can be reduced significantly by accepting a small amount of uncertainty about the failure status.by Yonggang Wen.Ph.D