Continued reduction in the size of a transistor has affected the reliability of pro-
cessors built using them. This is primarily due to factors such as inaccuracies while
manufacturing, as well as non-ideal operating conditions, causing transistors to slow
down consistently, eventually leading to permanent breakdown and erroneous operation
of the processor. Permanent transistor breakdown, or faults, can occur at any point in
time in the processor’s lifetime. Errors are the discrepancies in the output of faulty
circuits. This dissertation shows that the components containing faults can continue
operating if the errors caused by them are within certain bounds. Further, the lifetime
of a processor can be increased by adding supportive structures that start working
once the processor develops these hard errors.
This dissertation has three major contributions, namely REPAIR, FaultSim and
PreFix. REPAIR is a fault tolerant system with minimal changes to the processor
design. It uses an external Instruction Re-execution Unit (IRU) to perform operations,
which the faulty processor might have erroneously executed. Instructions that are
found to use faulty hardware are then re-executed on the IRU. REPAIR shows that
the performance overhead of such targeted re-execution is low for a limited number of
faults.
FaultSim is a fast fault-simulator capable of simulating large circuits at the transistor
level. It is developed in this dissertation to understand the effect of faults on different
circuits. It performs digital logic based simulations, trading off analogue accuracy with
speed, while still being able to support most fault models. A 32-bit addition takes
under 15 micro-seconds, while simulating more than 1500 transistors. It can also be
integrated into an architectural simulator, which added a performance overhead of 10 to 26 percent to a simulation. The results obtained show that single faults cause an
error in an adder in less than 10 percent of the inputs.
PreFix brings together the fault models created using FaultSim and the design
directions found using REPAIR. PreFix performs re-execution of instructions on a
remote core, which pick up instructions to execute using a global instruction buffer.
Error prediction and detection are used to reduce the number of re-executed instructions.
PreFix has an area overhead of 3.5 percent in the setup used, and the performance
overhead is within 5 percent of a fault-free case. This dissertation shows that faults
in processors can be tolerated without explicitly switching off any component, and
minimal redundancy is sufficient to achieve the same