Abstract. The major challenges that will be faced by designers of embedded systems based on future technologies are discussed. While providing many benefits, those technologies bring along several problems, such as higher defect rates, higher sensitivity to radiation induced transient faults, and the possibility of occurrence of multiple simultaneous faults and long duration transients. The main characteristics of future technologies are presented and the new challenges imposed to designers highlighted. Classic and recently proposed mitigation techniques are reviewed and the weaknesses that will impair their application to those technologies discussed. Recent research works aiming to cope with this new scenario are presented, analyzed and discussed, taking into account their impact on area, performance and power consumption. Strategies to cope with those challenges at different design levels are discussed and research paths that may lead to the solution of the problems are proposed.
Introduction
As the technology evolves, faster and smaller devices become available for manufacturing circuits that while more efficient, are more sensitive to the effects of radiation. The high transistor density, reducing the distance between neighbor devices, makes possible the occurrence of multiple upsets caused by a single particle hit. The achievable high speed, reducing the clock cycles of circuits, leads to transient pulses lasting longer than one cycle. All those facts preclude the use of several existing soft error mitigation techniques based on temporal redundancy, and require the development of innovative fault tolerant techniques to cope with this challenging new scenario.
In this tutorial the works that point to this new scenario are briefly presented, and existing mitigation techniques are analyzed, showing their weaknesses to cope with multiple simultaneous faults and long duration transients (LDTs). The need for innovative solutions to face these challenges is highlighted and recently proposed candidate techniques to deal with faults at different abstraction levels are presented and discussed.
Summary
The collision of an energetic particle with the silicon substrate of CMOS circuits causes the deposition of a charge that may affect the state of the hit device. If the collected charge is larger than the critical charge of the device, it may change the transistor's state, thereby inducing what is called a transient fault. After a given time, this charge dissipates and the device resumes its previous state. The time lapse since the charge is collected until it is dissipated is called the transient width. If the transient width is long enough, the wrong state may be latched by a storage element in the circuit, causing what is called a soft error, since the circuit is still able to store new correct values in the future.
When the predicted transient widths are contrasted with the evolution of typical circuit propagation delays across technologies, one can see that they don't scale at the same pace. For small energy particles, there is almost no variation in the transient widths, while for higher energy ones the variation is still far below that of the propagation delays. In parallel, the decreasing dimensions of devices make the occurrence of multiple simultaneous faults a growing concern for designers, since most of current mitigation techniques rely on the single fault model and cannot cope with simultaneous faults.
These facts lead to the prediction of a new scenario, in which many currently used soft errors mitigation techniques will no longer succeed.
Techniques based on temporal redundancy, which sample the outputs of a circuit twice, and then compare the obtained values in order to detect transient errors, use a fixed time interval between the two samplings. This interval must be larger than the maximum expected transient width, otherwise two equally erroneous output values might be considered correct. In order to cope with long duration transients, the time between samplings should be larger, but since the duration of the transients may become equal to or larger than the cycle time of the circuits, this alternative is precluded by the unbearable performance penalty it imposes.
In turn, classic techniques based on space redundancy, such as duplication with comparison (DWC) and triple modular redundancy (TMR), despite being able to cope with LDTs, still impose very high area and power consumption overheads, which are very scarce resources in the embedded systems arena. Other techniques, working at component or circuit level, will also become .
Given this scenario, the development of innovative mitigation techniques to deal with LDTs and multiple simultaneous faults becomes mandatory.
In this tutorial, recently proposed techniques are analyzed in order to determine their adequacy to cope with those problems. Furthermore, it is shown that the mitigation of faults at the lower abstraction levels of digital systems usually implies higher overheads than those imposed by higher level mitigation techniques.
The use of algorithm or system level mitigation techniques is then evaluated as a possible path to the hardening of complete systems against radiation induced faults, and some proposed works that adopted this strategy are presented and discussed.
Finally, it is shown that a combination of different techniques, working at distinct abstraction levels, maybe the best approach to be adopted, since errors at the lower levels of a system may even impair the ability of the system to deal with faults at the higher levels.
