Low Overhead Soft Error Mitigation Methodologies by Prasanth, V
 iii 
 
Abstract 
CMOS technology scaling is bringing new challenges to the designers in the form of 
new failure modes. The challenges include long term reliability failures and particle 
strike induced random failures. Studies have shown that increasingly, the largest 
contributor to the device reliability failures will be soft errors. Due to reliability concerns, 
the adoption of soft error mitigation techniques is on the increase. As the soft error 
mitigation techniques are increasingly adopted, the area and performance overhead 
incurred in their implementation also becomes pertinent. This thesis addresses the 
problem of providing low cost soft error mitigation.  
The main contributions of this thesis include, (i) proposal of a new delayed capture 
methodology for low overhead soft error detection, (ii) adopting Error Control Coding 
(ECC) for delayed capture methodology for correction of single event upsets, (iii) 
analyzing the impact of different derating factors to reduce the hardware overhead 
incurred by the above implementations,  and (iv) proposal for hardware software co-
design for reliability based upon critical component identification determined by the 
application executing on the hardware (as against standalone hardware analysis). 
This thesis first surveys existing soft error mitigation techniques and their associated 
limitations. It proposes a new delayed capture methodology as a low overhead soft error 
detection technique. Delayed capture methodology is an enhancement of the Razor flip-
flop methodology. In the delayed capture methodology, the parity for a set of flip-flops is 
calculated at their inputs and outputs. The input parity is latched on a second clock, 
which is delayed with respect to the functional clock by more than the soft error pulse 
width. It requires an extra flip-flop for each set of flip-flops. On the other hand, in the 
Razor flip-flop methodology an additional flip-flop is required for every functional flip-
flop. Due to the skew in the clocks, either the parity flip-flop or the functional flip-flop 
will capture the effect of transient, and hence by comparing the output parity and latched 
input parity an error can be detected. Fault injection experiments are performed to 
evaluate the benefits and limitations of the proposed approach.  
 iv 
 
The limitations include soft error detection escapes and lack of error correction 
capability. Different cases of soft error detection escapes are analyzed. They are 
attributed mainly to a Single Event Upset (SEU) causing multiple flip-flops within a 
group to be in error. The error space due to SEUs is analyzed and an intelligent flip-flop 
grouping method using graph theoretic formulations is proposed such that no SEU can 
cause multiple flip-flops within a group to be in error. Once the error occurs, leaving the 
correction aspects to the application may not be desirable. The proposed delayed capture 
methodology is extended to replace parity codes with codes having higher redundancy to 
enable correction. The hardware overhead due to the proposed methodology is analyzed 
and an area savings of about 15% is obtained when compared to an existing soft error 
mitigation methodology with equivalent coverage. 
The impact of different derating factors in determining the hardware overhead due to 
the soft error mitigation methodology is then analyzed. We have considered electrical 
derating and timing derating information for the evaluation purpose. The area overhead 
of the circuit with implementation of delayed capture methodology, considering different 
derating factors standalone and in combination is then analyzed. Results indicate that in 
different circuits, either a combination of these derating factors yield optimal results, or 
each of them considered standalone. This is due to the dependency of the solution on the 
heuristic nature of the algorithms used. About 23% area savings are obtained by 
employing these derating factors for a more optimal grouping of flip-flops. 
A new paradigm of hardware software co-design for reliability is finally proposed. 
This is based on application derating in which the application / firmware code is profiled 
to identify the critical components which must be guarded from soft errors. This 
identification is based on the ability of the application software to tolerate certain errors 
in hardware. An algorithm to identify critical components in the control logic based on 
fault injection is developed. Experimental results indicated that for a safety critical 
automotive application, only 12% of the sequential logic elements were found to be 
critical. This approach provides a framework for investigating how software methods can 
complement hardware methods, to provide a reduced hardware solution for soft error 
mitigation. 
