45,037 research outputs found
On fault tolerance and scalability of swarm robotic systems
This paper challenges the common assumption that swarm robotic systems are robust and scalable by default. We present an analysis based on both reliability modelling and experimental trials of a case study swarm performing team work, in which failures are deliberately induced. Our case study has been carefully chosen to represent a swarm task in which the overall desired system behaviour is an emergent property of the interactions between robots, in order that we can assess the fault tolerance of a self-organising system. Our findings show that in the presence of worst-case partially failed robots the overall system reliability quickly falls with increasing swarm size. We conclude that future large scale swarm systems will need a new approach to achieving high levels of fault tolerance. © 2013 Springer-Verlag
Resilient random modulo cache memories for probabilistically-analyzable real-time systems
Fault tolerance has often been assessed separately in safety-related real-time systems, which may lead to inefficient solutions. Recently, Measurement-Based Probabilistic Timing Analysis (MBPTA) has been proposed to estimate Worst-Case Execution Time (WCET) on high performance hardware. The intrinsic probabilistic nature of MBPTA-commpliant hardware matches perfectly with the random nature of hardware faults.
Joint WCET analysis and reliability assessment has been done so far for some MBPTA-compliant designs, but not for the most promising cache design: random modulo. In this paper we perform, for the first time, an assessment of the aging-robustness of random modulo and propose new implementations preserving the key properties of random modulo, a.k.a. low critical path impact, low miss rates and MBPTA compliance, while enhancing reliability in front of aging by achieving a better – yet random – activity distribution across cache sets.Peer ReviewedPostprint (author's final draft
Providing Integrity in Real-Time Networks-on-Chip
Mixed-critical real-time systems must meet strict integrity, resilience and timing constraints, as specified by safety standards. Due to the increasing threat of random hardware faults, efficiently achieving high reliability and dependability calls for cross-layer fault-tolerance solutions. This work introduces the Advanced Integrity Q-service (AIQ), a mechanism to ensure the integrity and predictability of on-Chip communication under random hardware faults. Devised for cross-layer and hierarchical fault-tolerance solutions, AIQ realizes low-overhead error detection in hardware and delegates error handling to arbitrary strategies in software. Experimental evaluation featuring benchmark applications and an industrial avionics use case shows that AIQ operates with high reliability and availability and low hardware and performance overheads. In a many-core mixed-critical platform under expected real-time scenarios, AIQ performs with execution time overhead between 1.4% and 7.1%
Recommended from our members
Fault-tolerant hardware designs and their reliability analysis
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Fault-tolerance, which is a complement to fault prevention, is an effective method of achieving ultra-high reliability. By taking this approach fault free computation can be achieved despite the presence of fault in the system. In this thesis three new fault tolerant techniques are presented and their advantages over well known fault-tolerant strategies are shown. One of these new techniques achieves higher reliability than any other similar techniques presented in the literature. Generally fault-tolerant structures consist of four major blocks: the replicated modules, the disagreement and detection circuit, the switching circuit, and the voting mechanism. The most critical component in a fault-tolerant system is the voter because the final output of the system is computed by this component. This dissertation presents a new implementation for voters which reduces both the complexity and the occupied area on the chip. The structures of the three techniques developed in this work are such that the complexity of their switching mechanisms grows only linearly with the number of modules but the voting mechanism complexity increases significantly. This is a better approach than those schemes in which the switching complexity increases significantly and the voter's complexity remains constant or grows linearly with the number of modules because it is easier to implement a complex voter than a complex switch (voters have more regular structures). Extensive comparisons are made between different fault-tolerant techniques. A new reliability model is also developed for system reliability evaluation of the new designs. The results of these analyses are plotted, and the advantages of the new techniques are demonstrated. In the final part of the work an expert system is described which uses the knowledge acquired by these comparisons. This expert system is meant as a prototype of a component of a CAD tool which will act as an advisor on fault-tolerant techniques
Recommended from our members
Modeling software design diversity
Design diversity has been used for many years now as a means of achieving a degree of fault tolerance in software-based systems. Whilst there is clear evidence that the approach can be expected to deliver some increase in reliability compared with a single version, there is not agreement about the extent of this. More importantly, it remains difficult to evaluate exactly how reliable a particular diverse fault-tolerant system is. This difficulty arises because assumptions of independence of failures between different versions have been shown not to be tenable: assessment of the actual level of dependence present is therefore needed, and this is hard. In this tutorial we survey the modelling issues here, with an emphasis upon the impact these have upon the problem of assessing the reliability of fault tolerant systems. The intended audience is one of designers, assessors and project managers with only a basic knowledge of probabilities, as well as reliability experts without detailed knowledge of software, who seek an introduction to the probabilistic issues in decisions about design diversity
Testability and redundancy techniques for improved yield and reliability of CMOS VLSI circuits
The research presented in this thesis is concerned with the design of fault-tolerant integrated circuits as a contribution to the design of fault-tolerant systems. The economical manufacture of very large area ICs will necessitate the incorporation of fault-tolerance features which are routinely employed in current high density dynamic random access memories. Furthermore, the growing use of ICs in safety-critical applications and/or hostile environments in addition to the prospect of single-chip systems will mandate the use of fault-tolerance for improved reliability. A fault-tolerant IC must be able to detect and correct all possible faults that may affect its operation. The ability of a chip to detect its own faults is not only necessary for fault-tolerance, but it is also regarded as the ultimate solution to the problem of testing. Off-line periodic testing is selected for this research because it achieves better coverage of physical faults and it requires less extra hardware than on-line error detection techniques. Tests for CMOS stuck-open faults are shown to detect all other faults. Simple test sequence generation procedures for the detection of all faults are derived. The test sequences generated by these procedures produce a trivial output, thereby, greatly simplifying the task of test response analysis. A further advantage of the proposed test generation procedures is that they do not require the enumeration of faults. The implementation of built-in self-test is considered and it is shown that the hardware overhead is comparable to that associated with pseudo-random and pseudo-exhaustive techniques while achieving a much higher fault coverage through-the use of the proposed test generation procedures. The consideration of the problem of testing the test circuitry led to the conclusion that complete test coverage may be achieved if separate chips cooperate in testing each other's untested parts. An alternative approach towards complete test coverage would be to design the test circuitry so that it is as distributed as possible and so that it is tested as it performs its function. Fault correction relies on the provision of spare units and a means of reconfiguring the circuit so that the faulty units are discarded. This raises the question of what is the optimum size of a unit? A mathematical model, linking yield and reliability is therefore developed to answer such a question and also to study the effects of such parameters as the amount of redundancy, the size of the additional circuitry required for testing and reconfiguration, and the effect of periodic testing on reliability. The stringent requirement on the size of the reconfiguration logic is illustrated by the application of the model to a typical example. Another important result concerns the effect of periodic testing on reliability. It is shown that periodic off-line testing can achieve approximately the same level of reliability as on-line testing, even when the time between tests is many hundreds of hours
- …