Control Caching : a fault-tolerant architecture for SEU mitigation in microprocessor control logic by Subramanian, Ganesh Tiruvaiyaru
Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 
1-1-2006 
Control Caching : a fault-tolerant architecture for SEU mitigation in 
microprocessor control logic 
Ganesh Tiruvaiyaru Subramanian 
Iowa State University 
Follow this and additional works at: https://lib.dr.iastate.edu/rtd 
Recommended Citation 
Subramanian, Ganesh Tiruvaiyaru, "Control Caching : a fault-tolerant architecture for SEU mitigation in 
microprocessor control logic" (2006). Retrospective Theses and Dissertations. 19055. 
https://lib.dr.iastate.edu/rtd/19055 
This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and 
Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses 
and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, 
please contact digirep@iastate.edu. 
Control :Caching: Afault-tolerant architecture for SEU mitigation in 
microprocessor control logic 
by 
Ganesh Tiruvaiyaru Subramanian 
A thesis submitted to the graduate faculty 
in partial fulfillment of the requirements for the degree of 
MASTER OF SCIENCE 
Major : Computer Engineering 
Program of Study Committee: 
Arun K. Somani, Major Professor 
Akhilesh Tyagi 
Gurupur Prabhu 
Iowa State University 
Ames, Iowa 
2006 
Copyright ~c Ganesh Tiruvaiyaru Subramanian, 2006. All rights reserved. 
11 
graduate College 
Iowa State Zniversity 
This is to certify that the master's thesis of 
Ganesh Tiruvaiyaru Subramanian 
has met the thesis requirements of Iowa State University 
Signatures ha ve been redacted 
for privacy 
111 
TABLE OF CONTENTS 
LIST OF TABLES   vi 
LIST OF Fi~URES   vii 
ABSTRACT  ix 
CHAPTER 1. Introduction   1 
1.1 Microprocessor Architectures   1 
1.2 Fault Tolerance in Microprocessors   2 
1.2.1 Soft Errors   3 
1.2.2 Soft Error Masking   4 
1.2.3 Soft error rate mitigation techniques for microprocessors   6 
1.3 Contributions of this thesis   8 
1.4 Organization of this thesis   10 
CHAPTER 2. Control Caching: An Overview   11 
2.1 Microprocessor Control Logic   11 
2.2 Component Schemes Of the SEU Mitigation Technique   12 
2.3 Opcode Dependent Control Caching   16 
2.4 Instruction Dependent Control Caching   16 
2.4.1 Workload Profiling for Control Caching   17 
2.4.2 Cache Architecture Design   18 
2.5 Dynamic Control- Signal Protection Scheme   19 
2.6 Architecture Evaluation by Fault Injection   20 
2.7 Summary   21 
1V 
CHAPTER 3.. Control Cache Size Determination using Workload Profiling 22 
3.1 Simplescalar Modifications   22 
3.2 Workload Profiling for SPEC2000 Benchmarks   23 
3.3 Matrix Multiplication in OpenRISC 1200 - A Case Study   30 
3.4 Summary   32 
CHAPTER 4. Control Caching Implementation in the OpenRISC 1200 33 
4.1 Motivation for RTL Implementation   33 
4.2 OpenRISC 1200: An Architecture Overview   34 
4.3 Control Signals in the OpenRISC 1200   35 
4.4 Architectural Modifications   36 
4.4.1 Control Caching Address and History Generation   38 
4.4.2 Control Cache Structure   41 
4.5 Summary   42 
CHAPTER 5. Fault Injection Strategies for Control Caching Evaluation . 44 
5.1 Fault Injection Model   44 
5.2 Simulation Strategy for RTL Models   45 
5.3 Fault Injection in the OpenRISC 1200   46 
5.3.1 Fault Injection on RTL Model   46 
5.3.2 Fault Injection on History Generation RTL Model   50 
5.4 Summary   50 
CHAPTER 6. Architecture and Performance Evaluation   51 
6.1 Fault Tolerance Metrics   51 
6.1.1 Fault Coverage   51 
6.1.2 Miscellaneous Metrics   56 
6.2 VLSI. Metrics   57 
0.2.1 Area Overhead   57 
0:2.2 Operating Frequency Penalty   58 
6.2.3 FPGA Synthesis Results   59 
6.3 Summary   60 
CHAPTER 7. Summary and Future Work   61 
7.1 Related Schemes and Comparisons   61 
7.2 Further Work   63 
7.3 Conclusions   ~ 63 
BIBLIOGRAPHY   65 
ACKNOWLEDGEMENTS  69 
V1 
LIST OF TABLES 
Table 3.1 Hit Rates for Direct Mapped and Fully Associative (LRU and FIFO 
Replacement Policies) for Various Cache Sizes - bzip   24 
Table 32 Hit Rates for Direct Mapped and Fully Associative (LRU and FIFO 
.Replacement Policies) fox Various Cache Sizes - gzip   25 
Table 3.3 Hit Rates for Direct Mapped and Fully Associative (LRU and FIFO 
Replacement Policies) for Various Cache Sizes - vpr   26 
Table 3.4 Hit Rates for Direct Mapped and Fully Associative (LRU and FIFO 
Replacement Policies) for Various Cache Sizes - twolf   27 
Table 3.5 Hit Rates for Direct Mapped and Fully Associative (LRU and FIFO 
Replacement Policies) for Various Cache Sizes - mcf  28 
Table 3.6 Hit Rates for Direct Mapped and Fully Associative (LRU and FIFO 
Replacement Policies) for Various Cache Sizes - gcc   29 
Table 3.7 Hit Rates for Direct Mapped and Fully Associative (LRU and FIFO 
Replacement Policies) for Various Cache Sizes - 64x64 matrix multipli-
cation program in the OpenRISC 1200 ISA   31 
Table 4.1 Control signals in the OpenRISC 1200 datapath   37 
Table 6.1 FPGA Synthesis Results for OpenRISC 1200   60 
V1I 
LIST of FIGURES 
Figure 1.1 Logical .masking of SEU   5 
Figure 1.2 Latching window masking and vulnerability window for SETS   6 
Figure 2.1 Typical 5-stage microprocessor pipeline with the addressed vulnerabil-
ities shown thatched   12 
Figure 2.2 High level view of the component cchemes of the SEU mitigation technique 14 
Figure 2.3 High level view of static instruction dependent control caching concept 19 
Figure 2.4 Fault injection methodology for RTL models   21 
Figure 3.1 Variation in hit rates with cache structure ,and number of cache entries 
- bzip   24 
Figure 3.2 Variation in hit rates with cache structure and number of cache entries 
- gzip   25 
Figure 3.3 Variation in hit rates with cache structure and. number of cache entries 
- vpr   26 
Figure 3.4 Variation in hit rates with cache structure and number of cache entries 
- twolf   27 
Figure 3.5 Variation in hit rates with cache structure .and number of cache entries 
- mcf   28 
Figure 3.6 Variation in hit rates with cache structure and number of cache entries 
- gcc   29 
Figure 3.7 Variation in hit .rates with cache structure and number of cache entries 
- 64x64 matrix multiplication in the openRISC ISA  31 
Vlll 
Figure 4.1 High level view of the OR1200 CPU   35 
Figure 4.2 A 512 entry CAM based FIFO replaced control cache address generation 
module   39 
Figure 4.3 A 512 entry direct mapped control cache address generation module 40 
_Figure 4.4 The FSM drawn up for tracking the history of each cache entry 41 
Figure 4.5 A 512 entry control caching structure fora 4-bit control signal   42 
Figure 4.6 Majority function implementation using universal gates   43 
Figure 5.1 Fault injection causing program error despite Control Caching 47 
Figure 5.2 Fault injection protected by Control Caching scheme   49 
.Figure 6.1 Fault coverage in a 512 entry CAM based FIFO replaced control cache 52 
Figure 6.2 Fault coverage in a 512 entry direct mapped control cache   53 
Figure 6.3 Fault coverage in a 1024 entry CAM based FIFO replaced control cache 54 
Figure 6.4 Fault coverage in a 1024 entry direct mapped control cache   55 
1X 
ABSTRACT 
The importance of fault tolerance at the processor architecture level has been made in-
creasingly important due to rapid advancements in the design and usage of high performance 
devices and embedded processors. System level solutions to the challenge of fault tolerance 
flag errors and utilize penalty cycles to recover through the re-execution of instructions. This 
motivates the need for a hybrid technique providing fault detection as well as fault masking, 
with minimal penalty cycles for recovery from detected errors. 
In this research, we propose `Control Caching', an architectural technique comprising of 
three schemes to protect the control logic of microprocessors against Single Event Upsets 
(SEUs) . High. fault coverage with relatively low hardware overhead is obtained by using both 
fault detection with recovery and fault masking. Control signals are classified as either sta-
tic or dynamic, and static signals are further classified as opcode dependent and instruction 
dependent. The strategy for protecting static instruction dependent control signals utilizes a 
distributed cache of the history of the control bits along with the Triple Modular Redundancy 
(TMR) concept, while the opcode dependent control signals are protected by a distributed 
cache which can be used to flag errors. Dynamic signals are protected by selective duplication 
of datapath components. The techniques are implemented on the OpenRISC 1200 processor. 
Our simulation results show that fault detection with single cycle recovery is provided for 92°0 
of all instruction executions. FPGA synthesis is performed to analyze the associated cycle 
time and area overheads. 
1 
CHAPTER 1. introduction 
Microprocessors have come a long way from the 4-bit 1000 transistor Intel 4004 to the 64-
bit billion transistor Pentium versions of 2005. These advancements have been fueled by rapid 
developments in VLSI technology and numerous architectural innovations. Microprocessors 
are now used in every walk of life, from embedded systems such as automobiles to mission 
critical applications like nuclear technology and space missions. 
Increasingly miniaturised systems and higher frequencies of operations, as well as utilization 
in hazardous environments, make microprocessors susceptible to faults. Various techniques at 
many levels of the design hierarchy have been developed to protect microprocessors against 
failure due to faults. Fault tolerance can be achieved by usage of redundancy in the spatial, 
temporal or information domain. Spatial redundancy could be in the form of duplicated or 
redundant functional units which perform the same processing as the original functional unit, 
but the results of which are used for confirming error-free execution or for recovery from errors 
caused due to the faults. Temporal redundancy is duplication in the time. domain. 
In this research, we propose a methodology to utilize spatial redundancy in a temporal 
manner to protect the control logic of processors against faults due to Single Event Upsets 
(SEUs) . There are two related concepts, namely, utilization of Triple Modular Redundancy 
(TMR) for the control bits in each pipeline stage, and distributed generation of control signals. 
Some parts of the control logic are also protected by selective component duplication. 
1.1 Microprocessor 1~.rchitectures 
Microprocessors are broadly classified as either Complex Instruction Set Computers (CISC) 
or Reduced Instruction Set Computers (R,ISC) depending on their Instruction Set Architecture 
2 
(ISA) and internal hardware design. 
In the early days of microprocessor development, the developed architectures were mainly 
CISC in nature. In fact, the most popular family of microprocessors from Intel was devel-
oped on the basis of CISC principles. CISC architectures tend to have a large number of 
instructions in the ISA. There are a minimal number of registers in the architecture, and the 
instructions are mostly memory-communication intensive. The datapath is complicated due 
to the necessity to support a large number of instructions and also because the instructions 
could be of varying sizes. The control signals are mainly generated from microcodes units or 
microprogram memory. As technology improved, the memory speed could not match up to 
the processor's speed. Anew architectural style had to emerge which could avoid frequent 
memory operations. This led to the development of the RISC architecture. 
RISC machines tend to have a limited number of instructions in the ISA. Memory traffic is 
greatly reduced by having aload-store architecture in which the instructions tend to operate 
more on register data, and communicate with the memory mainly for load and store opera-
tions. Pipelining the datapath is not complicated. All instructions are of a fixed size and the 
generation of control signals is usually from the instruction bits. 
Flynn's taxonomy classifies microprocessor architectures as Single Instruction Single Data 
(SISD}, Multiple Instruction Single Data (MISD}, Single Instruction Multiple Data (SIMD) 
and Multiple Instruction Multiple Data (MIND) depending on the number of concurrent in-
struction and data streams available to the architectures. In early days, SISD architectures 
used to dominate. Later on, array processors were developed, which belonged to the SIMD 
category. Modern day distributed systems are generally thought to belong to the MIND type 
architecture. MISD architectures are unusual, but they are very important in the field of fault 
tolerance because they can be used to implement redundant parallelism. 
1.2 Fault Tolerance in Microprocessors 
Performance and cost of microprocessors tended to be the most important factors in their 
design till the early 90s. However, reliability is becoming:the top concern of today's architects. 
3 
High profile bugs such as the 1994 Pentium FDIV bug in floating point operations and the 
public acknowledgment of Sun Microsystems that transient errors caused server failures for 
various e-business sites in 2000 [1] have made fault tolerant and reliable architectures a much 
more important aspect of computer design. In this thesis, we are concerned with fault tolerance 
techniques related to mitigation of soft errors. 
1.2.1 Soft Errors 
One of the most important factors threatening the reliability of future computer systems 
will be soft errors. An error can simply be defined as an unwanted change in the logic value of a 
signal. Soft errors, also referred to as transient errors or single event upsets (SEUs), are caused 
by radiation striking the surface of the silicon that makes up the processor. There are two 
common types for this radiation: alpha particles and high-energy neutrons. Alpha particles 
are generated by decaying impurities from the packaging of the processor. These particles also 
present a problem for merely shielding the chip from radiation, as the shield itself must be free 
from all impurities. The second type of radiation, high-energy neutrons, is naturally present 
in the environment. The numbers of neutrons present increase as one moves away from the 
earth's surface, and cause a large problem from microcontrollers operating in space. Whatever 
the source, these particles of radiation can generate a short, 100-200 picoseconds, pulse of 
current. This spike can switch the output of a signal, and lead to a fault. These soft errors are 
arbitrary and transient in nature, and are caused by unstable environmental conditions such 
as presence of radiation. 
While modern VLSI technology boosts the performance Of microprocessors, they also in-
crease their susceptibility to soft errors. This is due to the decrease in the amount of charge 
necessary to carry and store information in the circuits. This increases the probability that 
alpha particles and neutrons hitting the system could introduce errors in the behavior. As 
scaling in VLSI technology continues into the manometer regime, both memory elements and 
combinational gates become susceptible to soft errors. This is due to the fact that the transient 
pulses- induced by soft errors have durations often higher than the gate propagation delays [2] . 
4 
A soft -error can propagate without masking and hence affect the system behavior. Further, 
higher clock speeds decrease the cycle time, increasing the probability that a soft error is 
latched. These trends imply that future digital systems need to be protected against both 
Single Event Transients (SETS) and SEUs [3] . 
Memories have always been more susceptible to errors than the processor pipeline datapath 
itself, and hence,. some recent research has focused on estimating the soft error rate in the 
context of memory technology [4, 5] . Traditionally, error-correcting codes (ECCs) have been 
used to ensure memory reliability [6] . ECCs can drastically reduce the chance that an error will 
cause system .failure. Usage of information redundancy as described above in the form of either 
parity or Hamming codes is usually reserved for caches and memories, and their application to 
internal registers involves exorbitant cost. Currently, a given single-bit is expected to flip in a 
RAM only once every many billions of hours of operation. However, with the growing sizes of 
RAMS and -other hardware components, one can expect the error frequency to become much 
more noticeable in the near future. Also, processors deployed in hazardous environments will 
have a higher .probability of getting affected by bit-flips. 
1.2.2 Soft Error Masking 
It is hard to measure the frequency of occurrence of soft errors because not all errors 
necessarily translate into system failures. The final effect of the soft error is determined by 
the location and timing. A particular soft error may be inconsequential or may propagate for 
a certain time without affecting a computational result or the flow of instructions through the 
pipeline. In the absence of fault-tolerance features in the architecture, there are three main 
natural masking effects which could prevent the fault caused by the change in the signal value 
from manifesting itself as an error. They are: 
• Logical Masking 
• Electrical Masking 
• Latching Window Masking 
5 
Logical masking involves the combinational logic segment of the circuit. If the node of the 
circuit affected by the SEU is not currently affecting the output of the circuit, then the SEU 
will not cause an error. Figure 1.1 shows a scenario in which the output of the top OR gate is 
affected by a SEU and shifts its logic value from 1 to 0. However, the bottom OR gate outputs 
a 0 for the present cycle, and hence the AND gate will yield the correct output of 0 to be 
latched in the register on the clock edge. Thus, a soft error due to the SEU is not manifested 
due to logical masking by the AND gate. 
 t"f 
..r ,~ 
__. . 
. _ _µ_._.___ 
~ ~ 
Figure 1.1 Logical masking Of SEU 
The physical properties of the gates will cause attenuation of the transient pulse signal as 
it passes .through. Sometimes this effect is enough to cancel the spike caused by the radiation 
and thus no fault will occur. This is termed as electrical masking. 
The last ef.~ect that can cancel the SEU's ability to transition into a soft error is that 
~, 
of latching window masking. Since the combinational logic in a circuit is assumed to vary 
between clock cycles, the only value that matters is what is present right before the rising edge 
that latches the value. Thus, transient errors which occur during the downtime between clock 
cycles and whose effects do :not get latched cannot result in a failure. Figure 1.2 the earliest 
and latest points of time in a clock cycle that an SET may occur and get latched [7] . 
Although these 'built-in' factors work to mitigate the effects of SEUs and SETS, fault 
tolerance of combinational logic is still needed. This has become more of an issue with recent 
advances in technology. As processors become faster, with more pipeline stages, the ability of 
6 
Latching Clock Edge 
CLOCK 
SET Active 
Duration 
Non Latching 
SETs 
Earliest 
Latching SET 
Latest Latching 
SET 
Figure 1.2 Latching window masking and vulnerability window for SETS 
latching window masking to hide soft errors is greatly decreasing. Also, as devices get smaller 
and smaller, the effect of a single radiation strike becomes more damaging. For these reasons, 
there has been much research into providing fault tolerance for microprocessors. 
1.2.3 Soft error rate mitigation techniques for microprocessors 
Reliable systems usually employ hardware techniques to address soft errors. Lower-cost and 
more flexible alternatives in the form of some software techniques have been developed. Soft-
ware Implemented Fault Tolerant (SIFT) techniques allow the implementation of dependable 
systems without incurring the high cost of custom hardware design and hardware redundancy. 
SIFT techniques addressing SEUs were developed in the early 2000s [8, 9, 10] and comple-
mented older techniques such as ABFT [11] and Control Flow Check [12]. A recently proposed 
software-only, transient fault detection technique is SWIFT (Software Implemented Fault Tol-
erance) . It is a compiler-based transformation which duplicates instructions in a program and 
7 
inserts comparison instructions at suitable points during code generation. Before transient 
faults can adversely affect program output, the values which are computed twice (redundancy 
in the time domain) are compared for equivalence [13] . 
Hardware techniques for fault-tolerance are broadly classified as belonging to the informa-
tion domain, space domain or time domain. Redundancy in the information domain has been 
discussed earlier. Space redundancy is achieved by carrying out the same computation on 
multiple independent functional units at the same time. Errors are brought out on comparison 
of the redundant results. A majority election scheme can be used to obtain a correct answer 
for systems implementing TMR or higher redundancy under certain conditions of failure. In 
case there are only two redundant units available, the computation must be restarted if the 
two results do not match. Redundancy in the time domain avoids the large hardware overhead 
of space redundancy. It works by repeating the computation on the same hardware multiple 
times. Time redundant fault-tolerance techniques have the shortcoming that persistent hard-
ware faults may introduce identical errors to all redundant results, making errors indiscernible. 
There exists a complementary shortcoming in space redundant techniques because a transient 
failure mechanism may affect the space redundant hardware identically. 
The IBM 2900 [14] and Compaq Nonstop Himalaya [15] are examples of commercial fault-
tolerant computer systems which employ a combination of redundancy techniques described 
above. IBM 2900 (previously S/390) employs extensive fault-tolerant mechanisms for approx-
imately 20% to 30% of all logic inside the system. In particular, microprocessors for IBM 
mainframes employ two fully-duplicated lock-step pipelines. If the two pipelines disagree in 
the result of an instruction, the processor carries out extensive hardware checks, and, on tran-
sient errors, has the ability to restore the program state from a special hardware checkpoint 
module. The whole process can take up to several thousand processor cycles. Compaq Non-
stop Himalaya has two stock Alpha processors running the same program in complete locked 
step. The outputs of the two processors are compared on the external pins in every clock 
cycle. Operations are immediately suspended if there are disagreements, in order to prevent 
the corruption of the memory and storage subsystems. There is, however, no hardware support 
8 
provided for seamless recovery from transient failures. 
There are a multitude of SEU tolerant architectures which have been proposed from the 
research community. AR-SMT [16] proposes using the multi threading capability of modern 
processors to execute the program and a duplicate of the program in parallel as two threads. 
DIVA [17] uses spatial redundancy by providing a separate, slower pipeline processor along side 
the fast processor. The same instructions are executed on processors. With both the spatial 
and temporal redundancy schemes, the results of the instructions will be checked to ensure 
proper execution before being committed. Some special techniques have been proposed for 
superscalar processor cores. The Selective Series Duplex architecture consists of an integrity 
checking architecture for superscalar processors that can achieve fault tolerance capability 
of a duplex system at much less cost than the traditional duplication approach. It involves 
the combination of the CPU core pipeline in series with another pipeline which performs 
the re-execution of the instructions processed in the main pipeline. In case there are any 
mismatches in the operations of the two pipelines, a recovery process is triggered [18] . The 
REESE architecture takes advantage of spare elements in a superscalar processor to perform 
redundant execution. It minimizes the time overhead and also decreases it by strategic addition 
of a small number of functional units in the pipeline [19] . Joydeep Ray et al. propose afault-
tolerant extension for out-of--order superscalar processors in which a single processor selectively 
delivers fault-tolerance operating below potential, while reverting to full performance otherwise. 
This is achieved by providing extensions to take care of instruction replication in the dispatch 
stage, capability of fault detection and recovery [20] . It should be noted that most schemes can 
provide only fault detection, not correction, and must rely on flushing the pipeline to perform 
error recovery. 
1.3 Contributions of this thesis 
The previous section outlined a number of architectural techniques to protect the core 
datapath of the processor from transient faults. However, the aspect of the control logic 
is largely ignored. This is due to the fact that control signals are not easily amenable to 
9 
protection using the usual fault tolerance techniques. Control signals are either static or 
dynamic in nature, depending on their values for the same instruction at different points of 
execution time. It has been proposed to use the concept of caching signatures to determine 
faults in static control signals. Redundancy at component level handles dynamic signals [21] . 
The main shortcoming of all the above techniques is that the aspects of fault tolerance for 
various segments of the architecture are handled separately. Further, the techniques are too 
dependent on the nature of the processor. Some of them are applicable only to micro-controllers 
[22], some are restricted to superscalar architectures [18, 19, 20], and so on. Thus, there are 
many techniques which compare results of the execution of instructions in the writeback stage 
to ensure that the execution was error free. In case of a mismatch, the instruction is re-
executed. This results in penalty cycles. If the faults are masked to ensure that there is not 
much need for instruction re-execution, the penalty in the temporal domain could be avoided. 
This research presents a methodology to obtain high coverage with relatively low hardware 
overhead. 
We classify control signals as static and dynamic depending on variations in their values 
with the execution of the instruction at different points of time. Static control signals are 
further classified as instruction dependent and opcode dependent. We propose a `Control 
Caching' technique of three integrated schemes to protect different types of control signals, 
and to overcome the shortcomings of the present control logic protection schemes. It operates 
by providing for both masking as well as detection and correction of faults. 
The technique to protect instruction dependent static control signals exploits program lo-
cality of processor workloads. A novel methodology to take advantage of spatial redundancy 
in a temporal manner is presented. The spatial redundancy is achieved by usage of the TMR 
concept for instruction and control signal bits in each pipeline stage. The temporal redun-
dancy comes in the form of the workload program characteristics, in which software loops 
cause instructions to execute more than once. Faults caused in most iterations, which could 
have generated penalty cycles in other techniques, are completely masked. The proposed tech-
niques and design methodology are generic enough to be adapted during both the design and 
10 
implementation phase of any microprocessor or microcontroller. The technique to protect op-
code dependent static control signals relies on a distributed copy of control signals indexed by 
the instruction opcode. It is not reliant upon programs operating in loops. Dynamic control 
signals are handled by selective duplication of datapath components. 
A software profiling tool based on SimpleScalar is developed to analyze the workloads of the 
system and decide upon the optimum amount of spatial redundancy to use. This is particularly 
relevant in the case of embedded systems, where the type of workload is usually restricted. The 
application of this tool for asingle-issue in-order pipelined processor has been demonstrated 
using SPEC2000 benchmarks. 
The results from the profiling are used in the implementation of the proposed architectural 
modifications on the RTL model of a simple RISC processor used in embedded applications. 
The adaptability of the concept to different architectures is also brought out. 
1.4 Organization of this thesis 
A brief introduction to the background and work in the area of the problem being tackled 
has been provided in this chapter. Chapter 2 gives an overview of the control caching schemes 
and also brings out the need for an initial profiling of the workloads in the design phase for 
the static instruction dependent technique. Chapter 3 brings out the details of the developed 
workload profiling tool and presents details of the profiling of SPEC2000 benchmarks for imple-
menting the technique on a simple microprocessor based on the Alpha ISA. Chapter 4 presents 
details of the architectural modifications on the RTL description of a simple microprocessor 
geared towards embedded applications. Chapter 5 discusses the fault injection techniques used 
to determine the effectiveness of the proposed architectural modifications. Chapter 6 details 
the evaluation of the scheme from the fault tolerance as well as VLSI perspective. Chapter 7 
concludes the thesis by comparing the proposed scheme with similar related ideas and giving 
directions for future work. 
11 
CHAPTER 2. Control Caching: An Overview 
The task of fault tolerance for control signals in a microprocessor datapath is tackled by 
classifying them at the design stage on the basis of their attributes and providing techniques for 
their protection. This chapter deals with microprocessor control logic and their classification 
for the purpose of the proposed technique. This is followed by a brief sketch of the component 
schemes. 
2.1 Microprocessor Control Logic 
All modern day microprocessors are based on a pipeline structure comprised of fetch, de-
code, execute, memory access and write-back stages. The control signals of a fetched instruction 
are generated in the decode stage and traverse through the pipeline along with the instruction 
itself. The purpose of the proposed technique is to tackle SEUs / SETs in the control logic 
segment of the datapath, as shown in Figure 2.1, without making any assumptions about the 
number of pipeline stages. 
The control signals of an instruction can be broadly classified as being either static or 
dynamic. Static control signals are those which remain invariant irrespective of the state 
of the processor when the instruction executes. For example, the register write signal for an 
instruction is not dependent upon the processor state, but only on the instruction itself. Hence, 
it is a static control signal. Dynamic control signals, on the other hand, are dependent upon 
the state of the processor when the instruction executes. For example, the signal indicating a 
taken branch could be dependent on whether the present contents of two registers are equal. 
Due to the fact that it changes depending on its execution point in the time domain, it is 
classified as a dynamic control signal. 
12 
Instruction 
Fetch 
Instruction I 
Fetch 
Module 
{Instruction 
I Memory) 
Instruction 
Decode 
::Control 
unit 
Register 
File 
.;  
ii 
im 
 f~ 
jam. 
m 
Execute 
m x 
D 
m 
i~ 
cQ 
CAD 
(fJ 
Memory 
Access 
_~ 
~~. 
 ~~ 
 Vi 
i 
i 
Execution  
Units 
Memory 
Access 
Stage 
{Data 
Memory) 
~~ 
i 
DI 
l i 
~~ 
~~ m 
~• E 
(D 
~; 
~. 
NI 
~! 
I 
i 
i 
Figure 2.1 Typical 5-stage microprocessor pipeline with the addressed vul-
nerabilities shown thatched 
Writeback 
Static control signals are further classified as being purely dependent on the instruction 
itself or only on the instruction opcode. The select control signal of the multiplexer at the 
input of the ALU is an example of the former. The select line could choose the contents from 
the register file, or from the immediate field or from the forwarding logic. In any case, it is 
dependent completely on the instruction and related dependencies, and can't be decided based 
upon the opcode alone. The control signal for the register file is dependent upon the opcode 
only. Hence, it is an example of a static control signal dependent upon instruction type. 
It would be difFicult to design a single scheme to protect all the above types of control 
signals. Hence, a technique consisting of three integrated schemes is proposed to provide fault 
tolerance for the control logic, one for each type of control signal. The next section provides a 
brief sketch of the components of the proposed technique. 
2.2 Component Schemes of the SEU Mitigation Technique 
A common quantitative principle used in computer architecture is the concept of locality 
of reference. It refers to the observation that 90% of the program execution time is spent in 
10% of the code [23] . This is due to the fact that most common workloads are heavily oriented 
13 
towards looping structures. An implication is that there would be a high probability that 
an instruction at a particular address would be processed multiple times in the course of a 
program's execution. The scheme for protecting instruction dependent static control signals 
works on the assumption that an SEU /SET will not affect a particular control signal (say, the 
write enable signal for the register file) in two out of three consecutive iterations of the same 
instruction. If each control bit of a particular instruction from the previous two iterations is 
stored, we would be able to maintain a history of these signals. In other words, the control 
signal bit is cached in the pipeline stage in which it is utilized. At any given point of time from 
the third iteration of an instruction's execution, we would have three values of the signal to 
choose from. A majority function can be implemented on the three values to arrive at a TMR 
based decision. The technique can provide complete fault masking from the third iteration of 
a loop, and detection from the second iteration. 
Static opcode dependent signals are easier to protect since a static cache indexed by the 
opcode can store the different control signals in each pipeline stage. A comparison of the 
indexed entry with the present value of the control signal can help in detecting faults. 
Dynamic signals are the most difficult to protect of the three different control signal types. 
Duplication of the component producing the dynamic control signal and comparing their results 
would detect faults. 
A high level view of the working of the component schemes is shown in Figure 2.2. It traces 
the execution of three sample instructions in a particular program. The instruction at PC value 
100H, (add ~r3,~r0, ~r0), is executed only once, while the instruction at PC value 160H, (addi 
~r4, ~r4, -1), is executed once everytime the program goes past the `LOOP' label. The third 
instruction under consideration (bf LOOP) is at PC value 180H, and modifies the PC value 
to enable program execution from the `LOOP' label if the flag bit in the status register is set. 
As the program executes for the first time, the instruction at PC value 100H is processed. The 
static opcode dependent control signals are protected by fault detection and correction since 
the datapath already has a distributed cache of control signal values corresponding to different 
opcodes. To illustrate the working of this component scheme, let us consider the register 
14 
Instruction Fetch 
Stage 
—n 
i 
Q 
-~' 
co 
~~ 
~7 
m 
rQ I 
to 
@ i 
~ i 
Write6ackReg 
Num for 
Instm@160H 
after TMR (if 
applicabley 
104H: add $r3, $r0, $r0 
120H: LOOP: 
i 
160H: adds $r4, $r4, -1 
I 
176N: 
I.180H: 
sfeq $r3, $r4 
bf LOOP 
Register 
File 
~!_~ I~ ~_~~ I 
l~ Static 
Opcode j 
~~ j Dependent 
I CanUol 
Fault Cache j 
Mitigated i ~ ' 
RegFileWrite 
signal For 
instrn@100H 
c-~ 
f~ 
Component dupllcatlan far 
protecting dynamic contras 
~ signals in Instm(a~1a0H 
/ ~v 
  I 
/~ X 
CopyB gets CopyA In the 2nd
and subsequent itera5ons of 
Instm@16nH 
//~ CopyA written in first and every 
~/ iteration of InsfmQ160H 
Static 
Instruction 
Dependent 
i  Control Cache 
With TMR 
Opcode Info. Module 
And 
RegFileWrite 
signal for 
Instrn@100H 
Execute, Memory 
Access and 
Subsequent Pipeline 
Stages 
Write6ackReg 
Num & 
ControlCache 
Addr for 
tnstm@160H 
Figure 2.2 High level view of the component cchemes of the SEU mitigation 
technique 
write signal in the writeback stage. In this particular case, the opcode is that of an R-type 
instruction. As soon as the instruction reaches the writeback stage, the opcode value indexes 
into the writeback stage control cache and fetches all the relevant control signals. The fetched 
control signals are compared against the values propagating along with the instruction through 
the pipeline. Any mismatch triggers a recovery mechanism, details of which are presented in a 
later section. It must be noted that control signals which vary from instruction to instruction 
can't be protected by this component scheme. 
As the program continues executing, the instruction at PC value 160H is processed for 
the first time. The static instruction dependent control signals protection scheme begins func-
tioning, and recognizes the fact that the instruction is being executed for the first time. To 
illustrate the functioning of this scheme, let us consider the destination register number in the 
register writeback stage. The destination register number is utilized for the writeback, but is 
also written into the instruction dependent control cache at the address corresponding to the 
15 
instruction's PC value. 
The program continues executing and reaches the instruction at PC value 180H. This 
instruction reads the value of the status register, whose fields are dynamic control signals. The 
values are generated by the instruction at PC value 176H. Either a second copy of the status 
register is utilized to decide the outcome of this instruction, or the component utilized by 
the instruction at 176H is duplicated. Thus, dynamic control signals are handled by selective 
duplication of processor components. 
For the purpose of this illustration, let us assume that the branch is taken and control shifts 
back to the instruction at the `LOOP' label. The instruction at 160H is executed again, and 
this time the logic associated with the instruction dependent control caching scheme recognizes 
that the instruction at that particular PC value has been executed before. The control signals 
are compared with the signals from the previous iteration and a mismatch triggers a recovery 
mechanism, just as in the static opcode dependent control signals protection scheme. This is 
done only for the second execution instance, in order to improve fault coverage. In addition 
to this comparison, the control signals of the instruction during this iteration are written into 
the relevant cache. The previous contents, instead of being overwritten, are shifted along, in 
such a way that the third execution instance of the particular instruction has at its disposal 
the history of control signals from the two previous iterations. A TMR based decision is made 
possible from the third iteration onwards with the aid of a moving window of the history of 
the control signals. 
It must be noted that the additional hardware involved, such as the gates involved in the 
TMR based decision, are also protected against faults due to the inherent fault masking nature 
of the component schemes, since they result in erroneous masking only when two components 
become faulty at the same time. Further, the additional cache memory instantiated is assumed 
to be protected by error correcting codes (ECC). ECC is a common feature in all modern RAM 
libraries, and is hence not elaborated further in this thesis. 
16 
2.3 Opcode Dependent Control Caching 
The majority of the control signals in a microprocessor datapath fall under the static 
instruction dependent category. Opcode dependent control signals form the next biggest set. 
As outlined in a brief sketch in the previous section, the protection for the opcode dependent 
control signals is facilitated by the addition of a distributed storage of control signals indexed 
by the opcode. The term `distributed' here implies that each pipeline stage has a cache which 
stores only the control signals to be processed in that particular stage. Thus, the size of the 
cache in each stage will be dependent on the number of control signals in that stage. 
The opcode of the fetched instruction is processed to yield the address in the opcode control 
cache. This address flows through the pipeline along with the instruction. The control signal 
and the contents read from the cache are compared, and any mismatch implies a fault in the 
control signal. Before the faulty control signal can be used, the pipeline should be frozen. 
This freeze lasts long enough for the transient or upset to die down. This implies a strategic 
placement of the cache contents for each signal such that the fault detection happens in the 
stage prior to the usage of that particular control signal. 
In order to ensure that the opcode itself is free from transient errors, it is treated as 
an instruction dependent control signal and is protected by the static instruction dependent 
control signal scheme outlined in the next section. 
It must be noted that this scheme is not dependent upon the programs operating in loops. 
It offers protection to all the executed instructions. On the downside, there is no fault masking 
and the scheme can't prevent penalty cycles while recovering from faults. 
2.4 Instruction Dependent Control Caching 
A brief sketch of the static instruction dependent control signals protection scheme has 
been already outlined. In this section, we discuss the finer details of the scheme. These include 
the logic associated with determining whether an instruction under consideration is part of a 
loop, and if so, determining the number of times it has been executed before. It is obvious 
that the history of control signals can't be stored for the entire program, but only for a limited 
17 
number of instructions. The number of instructions to keep track of, and how to accomodate 
new instructions if all the entries are filled are some open issues. In other words, the cache 
organization for the control cache has to be decided. Another interesting aspect is the design 
of the distributed cache structure. 
2.4.1 Workload Profiling for Control Caching 
Knowledge of the type of workloads expected to be processed can help in designing efficient 
architectures. In the case of the proposed technique, it is all the more relevant because the 
amount of spatial redundancy required is also dictated by the same. The amount of spatial 
redundancy refers to the number of instructions for which the history is stored, or in other 
words, the number of cache entries. 
The profiling consists of determining the trace of the instruction addresses accessed in 
each cycle of the processor's operation. A cache simulator takes this trace as its input and 
determines the type of cache organization and number of entries which would be yielding the 
highest hit rate. The hit rate is directly related to the amount of coverage obtained by the 
proposed fault tolerance technique, since the majority function module kicks into operation 
only when the particular instruction is executed more than two times. 
The nature of the problem needs the determination of the type of cache structure to use. 
A direct mapped structure or a fully associative structure are two options lending themselves 
to easier design. For a fully associative structure, it is left to determine the replacement policy 
to use. The number of entries need to be determined for both type of structures. The most 
popular replacement policies are the Least Recently Used (LRU) and First In First Out (FIFO) 
structures [23] . Cache sizes are dependent upon the complexity of the processor and can range 
from as low as 4 entries to as high as 1K entries. 
The developed profiling tool determines the hit rates for a given instruction address trace 
for various cache sizes, organizations and replacement policies. Further details are described 
in Chapter 3. 
18 
2.4.2 Cache Architecture Design 
The workload profiling helps in making the decision regarding the number of cache entries 
and also the type of cache structure to use. The cache architecture is a distributed one, with 
entries in each pipeline stage. This is necessitated due to the fact that the control bits usage 
are also distributed throughout the pipeline. The tag tracking segment of the cache is placed 
in the pipeline stage where the instruction is fetched. This enables the succeeding pipeline 
stages to efficiently identify the address of the cache table entry. 
A high level view of the operation of the scheme for a single bit of a control signal is shown 
in Figure 2.3. The operation of the scheme, as has already been outlined, can be explained in 
the context of 2.3, which shows the execute stage of a normal five stage pipelined RISC type 
microprocessor [23]. The control signals for each pipeline stage travel along with the data in 
the pipeline registers. A particular control signal might be used to select the first operand of 
the ALU from either the register contents or the sign-extended immediate data (both of which 
are stored in the pipeline registers). Assume that a control signal is generated for the select 
line of the multiplexer which selects the ALU inputs, and the instruction is being executed for 
the first time, i.e., it is the first iteration of the loop. The control signal bypasses the majority 
gate and is used directly. In the meanwhile, it is also stored in the first memory element of 
the instruction's entry corresponding to that control signal in the control cache. During the 
second iteration of the loop, the entire action is repeated. However, the control signal stored 
in the first memory element shifts along the shift register arrangement and results in a history 
of the last two control signals. When the loop executes for the third iteration, the multiplexer 
which chooses the control signal opts to take the output of the majority function instead of 
the control input directly. Thus, a decision on the control signal to use is made with the aid of 
the recent history of execution of that instruction. This scheme is guaranteed to work under 
the following conditions: 
• The first two iterations of the loop don't have any control signal errors. 
• No two out of three consecutive iterations of the instruction in the loop has an error in 
19 
Control Signal in Previous Iteration 
r_ontrot Signal in the Iteration 
before the Previous One 
~t,~Jt~P,ITY 
I 
Pipe{ine Stage -Execute 
Figure 2.3 High level view of static instruction dependent control caching 
concept 
the same control signal. 
Chapter 4 presents further details of the modifications necessary in the RTL model of a 
microprocessor to implement the above technique. 
2.5 Dynamic Control Signal Protection Scheme 
Dynamic control signals are the most difficult to protect due to their inherent unpredictabil-
ity. Fortunately, they comprise the least number of control bits. The scheme to protect these 
bits relies on selective duplication of the source. Analysis of an industry standard microproces-
sor for embedded applications [29] reveals that only 17% of the control bits are dynamic in 
nature. These signals do not originate from the control unit, but are produced as a result of 
some operation in the datapath (such as branch or exception related signals, or signals relating 
to stalls) or some external source (such as trap and system call signals). 
For example, the signal indicating a taken branch might be dependent on the value in 
the status register. Duplicating this status register (or the component writing to the status 
register) yields two values, which can be compared to determine faults (before writing to 
the status register itself), if any. Stall detection logic can again be managed with minimal 
20 
redundancy in the form of duplication of comparators and source signals. External signals can 
be made to propagate on two separate lines and their integrity can be ensured before utilization 
in the datapath. 
At the expense of some penalty cycles, the execution can be frozen till the concerned signals 
match each other. Similar to the scheme outlined in Section 2.3, protection by detection and 
correction is offered to all instructions, irrespective of the looping nature of the program. There 
is no fault masking, and detected faults invariably lead to penalty cycles. However, it must 
be kept in mind that in comparison with existing schemes, the number of penalty cycles for 
protecting the same amount of logic is reduced by a considerable amount. The exact reduction 
in the number of penalty cycles can be determined by evaluating the ratio of the number of 
control signal bits protected by complete fault masking to the total number of control bits. 
2.6 Architecture Evaluation by Fault Injection 
The effectiveness of fault-tolerance architectures can be evaluated in two ways, namely, 
injection of physical faults into the target system by, say, exposure of the system to direct 
radiation [24] or fault injection and simulation in the RTL model [25, 26, 27]. While the 
former methodology can be adopted only in the post-silicon phase, the latter approach can 
be used in the design phase itself. In order to determine the effectiveness of the proposed 
architecture in tolerating SEUs / SETs, we perform fault injection and simulation on the RTL 
model. 
The fault injection methodology consists of identifying the fault injection points followed 
by replacing each of the nodes with a multiplexer as shown in Figure 2.4. A SEU can cause 
a particular node to become stuck at 1, or stuck at 0 or cause the logic level of the signal 
to change. Depending on the type of fault to inject, the select line of the multiplexer can be 
set. In case afault-free simulation is to be performed, the select lines of the multiplexer can 
be set to pass the unaltered signal through. The modifications outlined in Figure 2.4 would 
imply that there would be no fault injected if the select lines of the multiplexer are set to 00, a 
transient inversion fault would be injected if it were O1, a transient stuck-at-0 if it were 10 and 
21 
Fault Injection 
Selector [1 :0] 
f=ault 
 ~► Injected 
Node A 
0 
Figure 2.4 Fault injection methodology for RTL models 
a transient stuck-at-1 if it were 11. The select lines are decided based upon the distribution of 
the different types of faults which are to be injected into the architecture. 
Fault injection studies on RTL models of huge systems involve huge simulation resources. 
Hence, an equivalent, but pessimistic, fault injection strategy was devised on a smaller scale 
RTL model and utilized for determining the fault coverage on SPEC2000 benchmarks. Chapter 
5 presents further specific details on the fault injection methodology and equivalence. 
2.7 Summary 
The chapter presented a brief overview of the proposed fault tolerance technique. It also 
outlined the details of the work carried out towards proving the hypothesis that the `Control 
Caching' technique would be effective in protecting microprocessor control logic against SEUs 
and SETS. Thse include the development of a software tool to aid in the design phase, RTL 
modification to implement the fault tolerance technique and evaluation of the performance of 
the modified architecture. 
22 
CHAPTER 3. Control Cache Size Determination using Workload Profiling 
The instruction dependent static control caching technique can be implemented in an ar-
chitecture only after careful analysis of the type of workloads expected to be input to the 
system. This can be done prior to the design stage and a software tool is developed for the 
same. The tool can be integrated with the Simplescalar tool set [28] to enable evaluation of 
the architecture early in the design phase. 
This chapter details the additions to the Simplescalar tool set and simulation results for 
various SPEC2000 benchmarks, assuming that the technique is to be implemented on an 
architecture following the Alpha ISA. Further, a sample application is chosen to be run on the 
RTL model of the OpenRISC 1200 CPU [29]. The cache size is determined using the developed 
tool and is used in the RTL implementations detailed in the rest of the thesis. 
3.1 Simplescalar Modifications 
The instruction dependent static control caching technique proposes the addition of a cache 
like structure in each pipeline stage of the processor. This cache can be modeled by reflecting 
it in the Simplescalar tool set. This is done by obtaining the trace of values of the program 
counter register as the benchmark runs. The trace is then input to a separate procedure, whose 
development is outlined below. 
The designed cache structure is either fully associative or direct mapped in nature. For a 
fully associative structure, the entire program counter value must be used to tag the cache. 
This enables easy identification of the cache entry to use. Further, only the address of the 
entry needs to be passed on to the succeeding pipeline stages, and not the program counter 
value itself. The tag bits need not be duplicated in the entries in the other pipeline stages 
23 
also. In a direct mapped structure, only the bits of the PC which are not used to index into 
the cache need to be stored. 
These cache-like structures are modeled in a separate procedure wherein the option of the 
number of entries in the cache as well as the replacement policy to use (LRU or FIFO) can be 
set by the user. The proposed technique offers protection against faults only when a particular 
instruction is being executed for the third time or higher, and the entries are still in the cache. 
However, it can raise an alarm state in the processor to prevent it from committing instructions 
from the second iteration onwards. The developed procedure keeps track of the number of such 
instructions (the 'hits' in the cache) and also the total number of requests or trace entries. It is 
a simple task, then, to determine the percentage of instructions for which the technique could 
actually afford protection, and the percentage of instructions which went unprotected. 
The details of the simulations performed and the results obtained are presented in the next 
section. 
3.2 Workload Profiling for SPEC2000 Benchmarks 
The Simplescalar tool set supports two different ISAs, namely, the Alpha ISA and PISA. 
For the purpose of evaluating the control cache size requirements, precompiled Alpha binaries 
for six different integer benchmark programs were chosen from the SPEC2000 suite. The 
Simplescalar tool set was set to execute in the Alpha ISA mode and 1 billion instructions in 
each of the benchmarks were processed after skipping the initialization segment to yield the 
protection rate using the proposed technique. 
Tables 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6 present the percentage of executed instructions 
which managed to make a 'hit' in the cache, or, in other words, were afforded some level of 
protection by the control cache, for cache sizes from 4 to 1024 entries and LRU and FIFO 
replacement policies for the fully associative structure and the direct mapped structure. A 
graphical view of the same results can be obtained from Figures 3.1, 3.2, 3.3, 3.4, 3.5 and 
3.G. 
The results suggest that the FIFO policy is the best replacement policy. In fact, in four out 
24 
Table 3.1 Hit Rates for Direct Mapped and Fully Associative (LRU and 
FIFO Replacement Policies) for Various Cache Sizes -bzip 
LRU FIFO Direct Mapped 
Number of Cache Entries 
4 0.48 0.0003 0.021 
8 1.12 1.47 2.92 
16 2.41 26.70 29.09 
32 4.34 37.79 39.66 
64 8.18 47.96 54.46 
128 12.78 65.52 70.88 
256 19.95 93.06 82.91 
512 19.97 98.39 93.1 
1024 56.45 99.99 94.22 
y ~ 
d 
U 
d 
O 
a 
o' o 
0 
—o— Hit Rates tRU —~ Hit Rates FIFO ~-- Hit Rates Direct Mapped 
 O 
200 400 600 800 1000 1200 
Number of Cache Entries 
Figure 3.1 Variation in hit rates with cache structure and number of cache 
entries -bzip 
25 
Table 3.2 Hit Rates for Direct Mapped and Fully Associative (LRU and 
FIFO Replacement Policies) for Various Cache Sizes -gzip 
LRU FIFO Direct Mapped 
Number of Cache Entries 
4 2.12 1.29 1.32 
8 3.24 1.61 5.84 
16 5.21 9.14 11.19 
32 9.7 23.54 24.14 
64 18.69 40.99 46.03 
128 21.22 70.43 65.93 
256 21.22 86.79 73.83 
512 21.22 99.97 89.27 
1024 21.22 99.99 96.87 
0 0 
0 m 
o 
N ~ 
d 
V 
d 
O 
a 
o: o 
0 
0 
—fl--Hit Rates LRU —Hit Rates FIFO ~-Hit Rates Direct Mapped 
D 
2D0 400 600 800 1000 1200 
Number of Cache Envies 
Figure 3.2 Variation in hit rates with cache structure and number of cache 
entries -gzip 
26 
Table 3.3 Hit Rates for Direct Mapped and Fully Associative (LRU and 
FIFO Replacement Policies) for Various Cache Sizes -vpr 
LRU FIFO Direct Mapped 
Number of Cache Entries 
4 0.86 1.03 1.07 
8 0.87 1.33 1.46 
16 0.88 2.57 5.34 
32 0.98 13.69 19.42 
64 1.07 29.18 32.15 
128 1.53 33.51 44.98 
256 2.26 70.54 57.14 
512 5.18 78 65.86 
1024 9.25 96.92 74.84 
—a--Hit Rates LRU ~ Hit Rates FIFO -Hit Rates Direct Mapped 
0 0 
~
~
rn
~
~
r 
0 m 
0 co 
0 e 
0 ~m aoo eoo 
Number of Cache Entries 
so0 1000 1200 
Figure 3.3 Variation in hit rates with cache structure and number of cache 
entries -vpr 
27 
Table 3.4 Hit Rates for Direct Mapped and Fully Associative (LRU and 
FIFO Replacement Policies) for Various Cache Sizes - twolf 
LRU FIFO Direct Mapped 
Number of Cache Entries 
4 3.78 2.81 2.81 
8 5.09 2.82 3.08 
16 7.19 4.25 3.87 
32 11.27 6.43 6.93 
64 26.59 12.07 15.1 
128 35.71 24.66 25.09 
256 53.29 42.98 39.93 
512 70.37 61.45 51.89 
1024 91.14 87.21 71.33 
—Hit Rates LRU —fr—Hit Rates FIFO --o--Hit Rates Direct Mapped 
0 
0 
°o 
Pr
ot
ec
te
d 
h~
sn
 
0 
m 
0 
m 
0 
0 
N 
0 200 400 600 
Number of Cache Entries 
800 1000 1200 
Figure 3.4 Variation in hit rates with cache structure and number of cache 
entries - twolf 
28 
Table 3.5 Hit Rates for Direct Mapped and Fully Associative (LRIJ and 
FIFO Replacement Policies) for Various Cache Sizes -mcf 
LRU FIFO Direct Mapped 
Number of Cache Entries 
4 2.25 2.2 2.2 
8 2.44 2.22 2.53 
16 2.7 4.24 3.82 
32 3.93 5.13 4.95 
64 14.51 10.39 14.37 
128 30 18.49 29.89 
256 45.02 42.73 41.92 
512 70.71 55.72 56.47 
1024 94.04 81.55 69.6 
—Hit Rates LRU —ff—Hit Rates FIFO ~--Hit Rates Direct Mapped 
0 o - 
°
o
 
Pr
ot
ec
te
d 
In
sU
 
0 m 
0 co 
0 v 
0 N 
200 400 600 
N her of Cache Entries 
800 1000 1200 
Figure 3.5 Variation in hit rates with cache structure and number of cache 
entries -mcf 
29 
Table 3.6 Hit Rates for Direct Mapped and Fully Associative (LRU and 
FIFO Replacement Policies) for Various Cache Sizes -gcc 
LRU FIFO Direct Mapped 
Number of Cache Entries 
4 2.39 2.44 2.8 
8 2.48 5.43 9.73 
16 2.69 25.22 24.42 
32 3.08 49.02 46.83 
64 3.63 56.62 56.38 
128 4.15 65.17 62.01 
256 4.52 72.69 70.27 
512 4.92 78.84 75.36 
1024 7.45 85.91 83.36 
~ Hit Rates LRU —~ Hit Rates FIFO —o— Hit Rates Direct Mapped 
20D 400 6D0 
Number of Cache Envies 
800 100D 1200 
Figure 3.6 Variation in hit rates with cache structure and number of cache 
entries -gcc 
30 
of the six benchmarks considered, namely, bzip, gzip, vpr and gcc, it outperforms LRU in the 
percentage of protected instruction executions by 70°0 on an average. The LRIJ policy performs 
better than FIFO for two benchmarks, twolf and mcf. However, the difference in performance 
is not drastic, less than 15°~o in either case. The direct mapped structure performs consistenly 
well in all the benchmarks, and provides close competition to the FIFO structure. It can 
be safely said, however, that the FIFO policy is the best suited replacement policy for the 
control caching technique if a fully associative structure is used. Depending on the amount of 
protection required, the number of cache entries can be decided. 512 entries for the control 
cache appears to offer a good protection for all the benchmarks evaluated. 
A caveat to the above analysis is that the technique presented is heavily dependent on the 
ISA of the machine as well as the type of workloads. Some ISAs utilize a large number of 
instructions to perform basic tasks, which could in turn lead to a large number of instructions 
in each basic block. These types of programs will require a large number of entries in the control 
cache in order to ensure coverage or protection for the executed instructions. Another point 
to note is that the final coverage obtained would vary from workload to workload depending 
on the program and control flow structure. Programs which loop a large number of times are 
evidently offered more protection by the technique in contrast to programs with huge linear 
flow of control. 
3.3 Matrix Multiplication in OpenRISC 1200 - A Case Study 
The rest of the thesis will detail out the implementation of the proposed scheme using the 
RTL model of OpenRISC 1200, a processor suitable for embedded applications. The amount 
of spatial redundancy required is determined under the assumption that the processor is to be 
used in an application which involves a number of matrix operations. A program is written to 
multiply two 64x64 matrices and compiled into the OpenRISC 1200 ISA. It is then executed 
on the RTL model of the processor, and the trace of the program counter is dumped out, 
much akin to what was done in the Simplescalar model. The obtained PC trace is input to the 
software tool described in the previous section and the results of the analysis are presented in 
31 
Table 3.7 Hit Rates for Direct Mapped and Fully Associative (LRU and 
FIFO Replacement Policies) for Various Cache Sizes - 64x64 ma-
trix multiplication program in the OpenRISC 1200 ISA 
LRU FIFO Direct Mapped 
Number of Cache Entries 
4 0 0.004306 0.004306 
8 0 0.005383 0.097967 
16 0 0.125958 0.125958 
32 0 0.130264 0.125958 
64 0.002153 0.131341 0.1615 
128 0.128111 0.235768 0.235768 
256 20.5635 0.235768 54.8617 
512 97.46 99.39 99.39 
1024 99.39 99.39 99.39 
0 0 
0 m 
_ o 
N ~ 
Y 
V 
d 
O 
a 
e: o v 
0 ry 
0 
D 
~ Hit Rates LRU ~ Hit Rates FIFO ~ Hit Rates Direct Mapped 
200 400 600 
M bey oT Cache Envies 
800 1000 1200 
Figure 3.7 Variation in hit rates with cache structure and number of cache 
entries - 64x64 matrix multiplication in the OpenRISC ISA 
32 
Table 3.7 and Figure 3.7. 
The results again suggest that a 512 entry control cache with a FIFO replacement policy or 
a 512 entry direct mapped structure would be a good choice for the OpenRISC 1200 processor 
running matrix operations. The next chapter details the modifications in the RTL model of 
the processor to incorporate a 512 entry control cache. One version is designed with a fully 
associative structure, while another is designed with a direct mapped structure. 
3.4 Summary 
The chapter detailed the development of an addition to the SimpleScalar tool set to deter-
mine the amount and type of spatial redundancy requirements for a given processor workload 
specification. It also presented the results from utilizing the developed tool for six different 
SPEC2000 benchmarks using the Alpha ISA and also a sample application on the OpenRISC 
1200 processor. It was concluded that a FIFO type or a direct mapped structure would be 
best suited for the distributed control cache for static instruction dependent control signals. 
33 
CHAPTER 4. Control Caching Implementation in the OpenRISC 1200 
The fault tolerance techniques outlined in the previous chapters have to be tested by 
implementation on the RTL model of a microprocessor, in order to evaluate both effectiveness 
and performance. Towards this purpose, a processor from the OpenRISC 1000 family was 
chosen. This family consists of three sets of free RISC processor cores. The architectures 
are all 32/64-bit load and store designs with emphasis on simplicity, effective performance, 
scalability and low power requirements. This makes them ideal candidates for medium and 
high-performance networking, embedded, automotive and portable computing environments. 
In the rest of this chapter, we consider a specific implementation of the OpenRISC 1000 
family of processors, namely, the OpenRISC 1200 processor, and modify it to implement 
the fault tolerance techniques discussed in the earlier chapters. The chapter begins with a 
discussion of the motivation to do RTL implementation of the proposed technique. An initial 
overview of the OpenRISC 1200 architecture is provided, followed by emphasis on the control 
datapath of the design. The architectural modifications are then detailed out. 
4.1 Motivation for RTL Implementation 
The fault tolerance techniques can be implemented as an extension to a high-level archi-
tecture simulator such as the SimpleScalar tool set. In fact, the amount of spatial redundancy 
requirement has been determined by modeling the cache-like structure in SimpleScalar, as was 
outlined in the previous chapter. However, this type of implementation has its shortcomings, 
as detailed below. 
• Techniques implemented in architecture simulators do not imply implementation feasi-
bility in silicon, which is the eventually important goal. 
34 
• Architecture simulators are not developed enough to analyze specific VLSI metrics such 
as area overhead or cycle time penalties. 
• RTL models are the closest one can possibly get to the final architecture without going 
through the costly process of actual physical implementation on a chip, and studies 
carried out on this model will be the most accurate. 
The above considerations lead us to analysis of the OpenRISC 1200 architecture and RTL 
model described in the next section. 
4.2 OpenRISC 1200: An Architecture Overview 
The OpenRISC 1200 is a 32-bit scalar RISC processor with Harvard microarchitecture. 
There is a 5 stage integer pipeline, virtual memory support with the help of a memory man-
agement unit (MMU) and also some basic DSP capabilities. There are 53 distinct instruction 
opcodes in the ISA. 
There are the usual direct mapped data and instruction caches. Supplemental units include 
a debug unit, a tick timer, programmable interrupt controller and power management support. 
However, we are more concerned with the internal datapath of the CPU and the actual pipeline 
details, and the rest of this section details these features of the OpenRISC 1200. 
Figure 4.1 brings out the high level view of the OpenRISC 1200 CPU. The main compo-
nents of the CPU are the instruction unit, the general purpose registers (GPRs), the load-store 
unit, the the integer pipeline, the multiply-and-accumulate (MAC) unit, the exceptions and 
the system unit. 
The instruction unit of the CPU implements the instruction fetch stage of the pipeline by 
fetching the instruction from the memory subsystem and dispatching them to the appropriate 
execution unit. Conditional branch and unconditional jump instructions are also executed in 
this stage. The architecture also provides for thirty two 32-bit GPRs. The actual implementa-
tion consists of two synchronous dual-port memories with capacities of 32 words by 32 bits per 
word. The load-store unit handles the load/store instructions. It is designed to avoid stalls in 
35 
Insn MMU 
and cache 
f 
1 System 
~~J 
l~ 
i
I 
I 
Instruction Unit ~ 
Exceptions 
System 
GPRs 
CPU/DSP 
Integer Ex 
Pipeline 
MAC Unit 
,~ 
 ~, 
Load/Store Unit 
~~,_~~ 
Data MMU and 
cache 
Figure 4.1 High level view of the OR1200 CPU 
the master pipeline due to stalls in the memory subsystem. It provides for aligned accesses for 
faster memory operations. The integer pipeline is the heart of the architecture. Implemented 
instruction types include arithmetic operations, compare operations, logical operations and 
rotate and shift instructions. All integer operations take one cycle to complete in the execu-
tion stage of the pipeline, except for the multiply instruction which takes three cycles. The 
division instruction is not implemented in hardware. The MAC unit is responsible for DSP 
operations. The system unit implements all special purpose registers, and the exception unit 
handles exceptions such as interrupts, internal errors, system calls and breakpoints. 
The next section will deal with the control datapath of the CPU in great detail. 
4.3 Control Signals in the OpenRISC 1200 
The control signals are the most essential segment of any datapath. The entire operation 
of the datapath is coordinated by these signals. This section details the control signals in the 
OpenRISC 1200 processor. 
36 
Table 4.1 lists the details of all the control signals of the OpenRISC 1200 datapath and 
classifies them as static or dynamic. In addition, the bit width of the control signal as well 
as the pipeline stage in which it is utilized are also mentioned. It can be seen that there are 
53 control bits in total spread over 22 distinct control signals. Out of these, 44 are static in 
nature, while 9 are dynamic. Out of the 44 static signals, 25 of them are instruction dependent, 
while 19 are opcode dependent. Thus, for the considered processor, 47% of the control bits are 
protected by the static instruction dependent control caching scheme, while 33% are protected 
by the static opcode dependent caching scheme. The remaining 17% of the signals are protected 
by the dynamic control protection scheme. 
The next section details the architectural modifications necessary to implement the fault 
tolerance technique. 
4.4 Architectural Modifications 
The RTL model of the OpenRISC 1200 is available along with a GNU toolchain set, which 
enables a designer to test out various modifications at different levels in the processor design 
flow, right from the architecture implementation to compiler modifications. In this thesis, we 
are concerned with modifying the RTL description to add new components to the architecture 
and integrating them with the already existing processor. After modifying the architecture, 
extensive verification was performed to ensure functional correctness. Synthesizability of the 
modified RTL model was also ensured. 
The modifications can be broadly divided into three categories. The first one involves the 
design of the opcode indexed distributed control store, which is fairly straightforward. The 
second involves the duplication of some components which are responsible for the dynamic 
control signals. The third one, and most complex of the three, involves the design of the dis-
tributed cache structure for the instruction dependent static control signals. This modification 
is tackled by dividing the task into two parts. The first one involves the design of a module 
to determine the location of the present instruction's control signals in the distributed control 
cache. The history of the control cache entry can also be ascertained in this module itself. 
37 
Table 4.1 Control signals in the OpenRISC 1200 datapath 
S.No. Control Signal Purpose Nature Width utilization Stage 
1. branch_op Branching related data Dynamic 3 Decode 
2. rf_addrw Register file write address Static 5 Writeback 
3. rf_addra Register file read address 
A 
Static 5 Decode 
4. rf_addrb Register file read address 
B 
Static 5 Decode 
5. rf_rda Register file read enable A Static 1 Decode 
6. rf_rdb Register file read enable B Static 1 Decode 
7. alu_op ALU operation Static 4 Execute 
8. mac_op MAC unit operation Static 2 Execute 
9. shrot_op Shift /Rotate unit opera- 
tion 
Static 3 Execute 
10. rfwb_op Register file wrteback re- 
lated data 
Static 3 Writeback 
11. sel_a ALU input A selector Static 2 Execute 
12. sel_b ALU input B selector Static 2 Execute 
13. lsu_op Load /Store unit opera- 
tion 
Static 4 Memory Access 
14. comp_op Comparator operation Static 4 Execute 
15. multicycle Number of cycles in exe- 
cute stage 
Static 2 Execute 
16. sig~yscall System call signal Dynamic 1 NA 
17. sig_trap Trap signal Dynamic 1 NA 
18. no_more_dslot Delay slot data dependent 
upon branch 
Dynamic 1 NA 
19. ex_void Voids the execution de- 
pending on hazards 
Dynamic 1 NA 
20. id_macrc_op MAC operation related 
data 
Static 1 Execute 
21. ex_macrc_op MAC operation related 
data dependent upon 
memory stalls 
Dynamic 1 Execute 
22. except~llegal Exception signal Dynamic 1 NA 
38 
The second modification is the design and placement of the distributed control cache. The 
following two subsections detail each of these modifications. 
4.4.1 Control Caching Address and History Generation 
As outlined towards the end of the last chapter, there are two options for the type of 
distributed cache to utilize. One is to use a fully associative structure with FIFO replacement 
policy, while the other is to use a direct mapped structure. While the benchmarks seem to 
indicate that the FIFO structure would give a better coverage in comparison to the direct 
mapped structure, it is evident that the resource usage in the former would be higher than 
the latter. Hence, there is a trade-off involved in the decision, the extent of which can be 
determined only by RTL implementation. Fortunately, the segment of this module which 
determines the history of the cache entry need not be modified in the two different structures. 
The module for generating the control caching address and history needs to be placed in 
the instruction fetch stage of the pipeline. This is because the actual address of the instruction 
to be fetched is also available in this stage. The same address can be used to index into the 
created module, which can then output the control cache address and the history status for 
the instruction at that particular address. 
Figure 4.2 provides a high level view of a fully associative cache like structure for the 
purpose of control caching address generation. The FIFO structure implementation utilizes a 
Content Addressable Memory (CAM) module in which the entries are arranged in a circular 
queue fashion. A register holds the address of the next location to write to. Once an address 
comes in (which would be the address of the instruction being fetched), the CAM gives details 
as to whether there was a hit, and if so, the address of the hit. If there is a hit, the output 
address is then passed on as the control cache address. Otherwise, the address of the next 
location to write to, which was stored in a register as described earlier, is sent out as the control 
cache address. In this case, the history bits output also indicate that there is no history for 
the current address. This address in the register is also used to write into the CAM, and the 
data to write to would be the input address. The advantage of the FIFO structure is that 
39 
- WriteEnable 
InputAddress  
(WriteData) [29:0] 
WriteAddress [8:0] 
Content 
Addressable 
Memory 
[Circular Queue] 
 Hit 
---- HitAddress [8:0} 
~— -- --
QueueRearAddress [8:0] 
Queue Rear Address ~ 
Queue Rear Address 
Update Logic 
I 
-ControlCacheAddress [8:0]i 
At the beginning of each cycle, if the reference in the previous cycle is a 
miss and the input address has changed, 
Queue Rear Address ~ (Queue Rear Address + 1) mod 512 
Figure 4.2 A 512 entry CAM based FIFO replaced control cache address 
generation module 
all the available storage locations can be take advantage of, while the disadvantage lies in the 
requirement of a CAM module which is quite costly. Also, all the bits of the input address 
needs to be loaded into each CAM entry. 
Figure 4.3 shows the working of a direct mapped cache like structure implementing the 
same requirements as the FIFO CAM structure described above. The direct mapped structure, 
utilizes some of the least significant bits of the input address to index into the distributed 
control cache. To prevent aliasing, a memory module stores the tag bits (all the bits of the 
input address, other than the least significant bits used for the index) . The tag bits from 
the indexed location of the memory module are then compared with the tag bits from the 
input address. The information as to whether there is a hit or not enables the determination 
of the history of the control cache entry. If there is no hit, obviously a signal indicating the 
absence of the history is sent to the succeeding stages. Otherwise, the appropriate state bits are 
transferred to the output. The advantage of a direct mapped structure lies in the utilization of 
very less resources. Not only do lesser number of bits need to be stored for matching purposes, 
40 
Index F- InputAddress [8:0] 
InputTag F- InputAddress [29:9] 
InputTag  
(WriteData) 
--InputAddress [29:0] ► 
Index 
WriteEnable 
Tag 
Tag [Index] 
Storage 
ControlCacheAddress [8:0]-- ► 
Figure 4.3 A 512 entry direct mapped control cache address generation 
module 
the number of comparators needed is also reduced greatly in comparison to the FIFO CAM 
structure. The disadvantage lies in the fact that not all available locations can be used for 
storage. Also, there is a possibility of thrashing if alternating sets of address references map 
to the same locations. 
The finite state machine (FSM) drawn up for the maintenance of the history details of 
each cache entry is shown in Figure 4.4. It must be noted that the FSM lies dormant till 
its corresponding address is activated by a write in the cache entry. On the receipt of a reset 
signal, the cache entry sets itself in the NO~IISTORY state. When a new address is input, 
the fact that there is no history available is sent out in the same cycle. However, as the entry 
registers itself in the module, the fact that its first iteration is underway, is indicated by the 
update of the entry at the writable address to ONE_HISTORY. It is very much possible that 
the fetch address remains in the same state for multiple cycles, due to stalls or other reasons. 
For this reason, a register holds the address output in the previous cycle. A hit which was 
not present in an earlier cycle cause an update in the state provided the output address is not 
the same as the previously stored address. The saturation of the state machine in any of the 
states is prevented by the fact that the cache entry can always be replaced by a newly input 
address. 
41 
1 
NO_HISTORY i 
Cache location replaced ~~. 
with new entry 
Cache location replaced with new entry 
,~ 
ONE_HISTORY 
/' 
~\ ., 
Cache hit, and change on input address 
Reset 
Cache hit, 
/ ~ 
\\ 
TWO_HISTORY 
Cache hit, 
but no change on input address. ~ 
,~ 
Figure 4.4 The FSM drawn up for tracking the history of each cache entry 
The placement of the control caching address and history generation module in the instruc-
tion fetch stage of the pipeline results in a lengthening of the critical path and an operating 
frequency penalty of more than 16°0. In order to reduce the operating frequency penalty, it was 
decided to internally pipeline the instruction fetch stage. In the altered design, the instruction 
fetch stage is divided into two cycles, one for the tag and hit generation, and another for the 
tag and history update. This resulted in an increase in the latency of the processor, but the 
throughput remains the same. 
4.4.2 Control Cache Structure 
The control cache entry address and history generated above are passed down the pipeline 
and utilized in each stage. Each control signal bit to be protected has a structure similar to the 
one outlined in Figure 4.5 placed just before the point of utilization in its own pipeline stage. 
The structure conceptually consists of two planes of memory bits organized in a fashion to 
facilitate flow of a particular bit from the top to the bottom plane. This flow occurs whenever 
the control signal for the instruction at that particular cache address location is written. The 
42 
000 H 
001 H 
ControlCacheAddress [8:0] 
History [1:0]----► 
—PresentControlBits [3:0]--► 
1FE H 
1 FF H CopyB 
CopyA 
At the beginning of each cycle, subject to an active write enable, 
CopyA [ControlCacheAddress] ~- PresentControlBits 
CopyB [ControlCacheAddress] <— CopyA [ControlCacheAddress] 
 CopyB [ControlCacheAddress] [3:0}---► 
CopyA [ControlCacheAddress] [3:0]—► 
 PresentControlBits [3:0] 
I 
Majority Function j 
Module 
History [1}—
ProtectedPresentControlBits (3:0] 
l 
Figure 4.5 A 512 entry control caching structure fora 4-bit control signal 
protection is offered by reading the two sets of bits at the particular address, combining it with 
the input bits and using a majority function gate shown in Figure 4.6 to generate a new set of 
control bits. Depending on the history bits for that particular instruction, either the original 
control signal or the newly generated control bits are utilized in the datapath. A signal can 
be generated from each structure which could flag an alarm signal to the commit logic of the 
processor if the incoming control signal and the one on the top plane do not match with each 
other. However, this has not been implemented in the OpenRISC 1200 core due to the absence 
of specialized commit logic. 
4.5 Summary 
The chapter presented a brief introduction to the OpenRISC 1000 family of processor cores. 
This was followed by a presentation of the motivation behind RTL implementations of fault 
tolerance techniques, and then an overview of the architecture of the processor on which the 
proposed fault tolerance technique was implemented. Particular focus was put on the control 
43 
In0 In1 In2 
--► Majority (InO, In1, In2) 
Figure 4.6 Majority function implementation using universal gates 
datapath of the OpenRISC 1200 processor in the next section, and it was shown that the 
technique could offer protection for all the control signals. The last section dealt with the 
architecture design for the proposed technique in detail, first with a discussion of the address 
and history generation, followed by the structure of the distributed control cache. 
44 
CHAPTER, 5. Fault Injection Strategies for Control Caching Evaluation 
Fault injection studies are of utmost importance in evaluating the effectiveness of any fault 
tolerance strategy. They are useful in not only determining the fault coverage of the proposed 
technique, but also in ensuring that the modifications made to the architectures achieve the 
desired objectives. While extensive verification may prove that the architecture works on 
expected lines in the presence of the additional hardware, it can't prove that the architecture 
manages to protect the processor against faults, unless fault injection studies are carried out. 
This chapter presents the fault injection model considered for the OpenRISC 1200 proces-
sor. It follows by proposing a simulation strategy for the RTL model. The need for an 
equivalence RTL model for fault injection is brought out. The fault injection on the RTL 
model is described and its fault masking behavior is shown to be a subset of the behavior of 
the equivalent RTL model. 
5.1 Fault Injection Model 
The fault injection model is dependent on the fault model for which the fault tolerant 
hardware is designed. The control caching techniques assumes a SEU model for the SER. 
In this case, at any particular instance of time, only one particle hit can happen, and thus, 
only one particular signal can be affected. However, the duration of the upset can extend 
over multiple clock cycles. The frequency with which these upsets occur is dependent on the 
environment in which the processor is placed. The duration is dependent on the clock frequency 
of the design under consideration. 
A characteristic of SEUs is that they are random events and thus may occur at unpre-
dictable times. In this thesis, we focus on the fault model called upset or transient bit-flip. 
45 
Even though a SEU can cause transient stuck-at-0 or stuck-at-1 faults as outlined earlier, their 
probability is much less compared to the transient bit-flip. Further, transient bit-flips are the 
easiest to check for effects in fault injection simulations since it closely matches real faulty 
behavior [30] . 
The next section presents details of the simulation strategy utilizing the fault model de-
scribed above. 
5.2 Simulation Strategy for RTL Models 
Fault injection strategies for RTL models need to consider the aspects mentioned below. 
• Distribution of faults in the time domain 
• Number of faults to inject in the simulation 
• Nature of faults to inject 
• Architecture locations to inject faults in the RTL model 
All the above aspects are directly dependent on the fault injection model under consideration. 
The distribution of faults in the time domain is random in nature, as it is a characteristic 
of SEUs. The number of SEU faults to be injected in a given simulation is dependent on a 
variety of factors such as the considered hadron flux [31], the estimated vulnerable area of the 
chip, the probability that a hadron hit would cause a SEU and the duration of the simulation. 
Typical values of the number of faults to inject in a simulation would be around 1 in every 10 
million simulation clock cycles. Equation 5.1 gives an expression for the number of faults to 
inject in a particular RTL simulation. 
FNum = 
CpAN f 
fop 
Where, 
FNum is the number of faults to inject in the simulation, 
C is the number of clock cycles of simulation, 
fop is the frequency of the simulation clock in Hz, 
(5.1) 
46 
p is the probability that a hadron hit is effective in causing a SEU, 
Nf is the hadron flux in hits/cm2/second and 
A is the estimated vulnerable area of the chip in cm2. 
As mentioned in the previous section, it is best to inject inversion faults, although provision 
can be easily made in the simulations to account for transient stuck-at faults also. The most 
important decision would then be the locations to inject faults in. It would be tempting to go 
in for a large number of locations to inject faults in. However, the fault tolerance technique 
to be evaluated needs to be kept in mind. Suppose, a new type of ECC technique were to 
be considered for evaluation. It would make sense to inject faults in the memory segment of 
the system and test the technique's effectiveness, rather than the datapath or control. This is 
all the more important because RTL simulations of huge systems are very resource intensive. 
Hence, the fault injection locations must be judiciously chosen. 
5.3 Fault Injection in the OpenRISC 1200 
The OpenRISC 1200 processor involves a lot of HDL modeling, but we are in essence 
concerned only with the core datapath of the processor. This involves the processor's integer 
pipeline and the instruction fetch unit. Further, the proposed fault tolerance strategy involves 
the control signals of the datapath, and hence, to take maximum advantage of the fault injection 
simulation resources, injected faults are targeted only at the control lines. From the earlier 
chapter detailing the OpenRISC 1200 architecture, 53 control signal lines were identified for 
fault injection. The details of the simulations are presented in the following subsection. 
5.3.1 Fault Injection on RTL Model 
There are two fault tolerant architectures designed for evaluation, one with the CAM 
FIFO structure for the static instruction dependent control cache, and the other with the 
direct mapped structure. It was observed that the CAM structure took much more time for 
simulation in comparison with the direct mapped structure. Both the architectures with the 
fault tolerant additions were simulated to ensure functional correctness. Since each simulation 
47 
N_
~yr
t 
C
ur
so
r =
 1
,3
22
,7
1 1
 ns
 
In 
C 
~ ~ ~~ 
O r 
a 
a m 
II II Q ~ p ~ ~q~ 
d N m ~ N ~ t 
C C ~ ~_ c o, o  0 
N~ 
`m ~~ o 0 0 0= 
~_ a s v22 3 
C w ~ C N N N O a ~ ~ C R [0 0] ~ ~ ~ ~ t l 3 U ~ E ~~ in z c m a~ 
~yr
t 
f` 
N 
C`NJ 
0 
U 
00 _ ` ..000ooaoaao_o 
Figure 5.1 Fault injection causing program error despite Control Caching 
48 
took a huge amount of time to complete, and a justified fault coverage value can only be 
obtained with a reasonable number of simulations, fault locations in the time domain were 
targeted to occur when the signals were in their vulnerable state. Multiple control signals, but 
not all the 53, were targeted, but this subsection will detail the simulations on the writeback 
control signal for the register file as a case in point. 
The writeback signal for the register file is needed for operation in the writeback stage 
of the pipeline. A f aultin j ect signal was used to activate an inversion fault on this line in 
a particular cycle number during which the signal is active. Figure 5.1 shows a simulation 
waveform in which the fault has been activated in the execution of a R-type operation in the 
first iteration of a software loop. As expected, the history state for the instruction is not set 
to a value which would offer protection. Hence, the register file does not get updated with the 
necessary value, and so the execution is termed to be faulty. On the other hand, Figure 5.2 
shows the same instruction being targeted some iterations later. In this case, the history is 
set to a protection enabled state, and the majority function module ensures that the control 
signal finally delivered to the register file is the correct value despite the injected fault. 
The above process was repeated for many control signals at various points in the time 
domain. Sometimes, the faults in the control signals get automatically masked (like the effect 
of a fault on the shrot_op signal in a load instruction), and hence the processor doesn't fall 
into a faulty state even though the scheme did not enough history available to offer protection. 
It can be intuitively seen that the scheme offers protection in a particular cycle if the 1listory 
state for that instruction is set to TWO_HISTORY, and it can flag an alarm for a processor's 
commit stage if the history state is ONE_HISTORY. It must be noted, however, that the 
history state not being TWO_HISTORY doesn't mean that the processor will automatically 
fall into a faulty state in the presence of a fault. Hence, it can be said that the fault coverage 
obtained by the consideration of the history state of the instruction in the present pipeline stage 
and current cycle would be a superset of the fault coverage obtained by actually checking for 
the state of the processor after every fault injection simulation. Due to the simulation overhead 
involved, it would be better to go ahead with a pessimistic estimate of the fault coverage by 
49 
N 
C 
O 
O 
O 
N O ~~ ~~ 
o °' ` C 
7 ~ 
U ~ 
m 
Cu
rs
or
—
Ba
se
lin
e 
2,
10
0,
30
5n
s 
0 
rre
rt(
 1
30
] ,
 
rre
rr(
 1
29
] ,
 
rte
r~
[ 
1 
M 
m ~, 
o °' g _ o _ 
U 
a 
~ ~ 
O m O O O O ~ 
O 
O O ~ 
(h N 
j ~ ~ 3 ~ N_ N N O m ~~ ~~ ~ N r 
U U 
y 
b t~ 3 U 
O 
= C ~ U) t t N 'O r 
_~ o _ ® o_o o_o_ o a ® o-Q .a
Figure 5.2 Fault injection protected by Control Caching scheme 
50 
consideration of the history state signal. The next subsection details the simulation of the 
history generation RTL model to determine the fault coverage. 
5.3.2 Fault Injection on History Generation RTL Model 
There are two different RTL models available for the generation of the history state signal 
for the current instruction. One of them is the FIFO based CAM module, while the other is 
the direct mapped structure. The history state output of the module is related to the address 
currently input. The address trace of some of the SPEC 2000 benchmarks for the Alpha ISA, 
generated for determining the nature and amount of spatial redundancy outlined earlier, was 
input in the testbench for the RTL model. Hundred random fault injection cycles were chosen 
and the history state was recorded in each cycle. A history state of TWO_HISTORY was 
termed to afford complete masking, while a history state of ONE_HISTORY was termed to 
offer fault detection but no masking. The simulation of the history generation RTL model was 
carried out for both 512 and 1024 entries. 
The next chapter details the results obtained by the simulation of the history generation 
RTL models. They are found to closely correlate with the results obtained from the simulation 
of the SimpleScalar additional modules. 
5.4 Summary 
The chapter presented a brief overview of fault injection strategies for RTL models and 
discussed them in particular relevance to the OpenRISC 1200 processor. The motivation to 
obtain equivalent RTL models for the purpose of evaluation of the fault coverage metrics was 
also presented, and the methodology of evaluation was also presented. 
51 
CHAPTER 6. Architecture and Performance Evaluation 
The control caching structure provides fault tolerance at the cost of some overhead in the 
form of extra storage requirements. The extent of extra storage requirements must be justified 
and the various metrics must be evaluated. The previous chapters outlined the development 
of the technique, and its implementation as well as steps taken for evaluation. The evaluation 
of the proposed technique can be broadly based on two categories, namely, the fault tolerance 
metrics and the VLSI metrics. This chapter presents the theoretical analysis of the technique's 
metrics and provides results from the implementation carried out. 
6.1 Fault Tolerance Metrics 
The most important fault tolerance metric is fault coverage. It refers to the percentage of 
instructions which are actually protected against faults. Other metrics include the percentage 
of false positives, where the technique flags an error when there is none. This section details 
the various fault tolerance metrics for the control caching techniques. 
6.1.1 Fault Coverage 
The fault coverage metric is first analyzed with respect to the instruction dependent sta-
tic control signal control caching technique. This technique provides fault coverage only for 
instructions which are executed more than once. Faults are completely masked only if the 
instruction is executed more than two times. Thus, if there are N executions of a loop, com-
plete fault masking is provided for (N — 2) iterations, and the technique can detect an error 
for any one of (N — 1) iterations. In case of a specialized commit logic in the processor, and 
with compiler support, the technique can be extended to detect an error in any iteration of 
52 
a loop. However, this is an ideal situation in which there are an infinite number of locations 
available to cache the control signals. The size and type of the caching structure would ensure 
that the amount of fault coverage provided would be lesser than the expected value. Further, 
the extent of looping and size of the loops in terms of number of instructions in the considered 
program would definitely play a very big role in the fault coverage obtained. 
The previous chapter outlined the fault injection methodology on the RTL model and also 
explained how the fault coverage figures were obtained. Four different control caching config-
urations were considered as detailed below. The RTL models of each of these structures were 
simulated and the SPEC 2000 benchmark traces were input to determine the fault coverage 
with 100 random fault injection time points. 
• 512 entry CAM based FIFO structure 
• 512 entry direct mapped structure 
• 1024 entry CAM based FIFO structure 
• 1024 entry direct mapped structure 
Fully Associative FIFO Replacement Control Caching -512 Entries 
Pe
rc
en
ta
ge
 o
f I
ns
nu
ct
i 
100 
9D% 
~% 
70% 
~o~ 
50% 
AO% 
30% 
20% 
10% 
0% 
gZ1P bzip 
®Complete Detection &Masking 
gcc twolf 
Benchmark 
®Fault Detection, No Masking 
Ypf mcf 
a No Detection 
Figure 6.1 Fault coverage in a 512 entry CAM based FIFO replaced control 
cache 
53 
Figure 6.1 presents the results obtained fora 512 entry CAM based FIFO structure for the 
six considered benchmarks. It can be seen that bzip gives the highest amount of complete fault 
masking with 71% of the instructions being protected. me f has the least amount of complete 
fault masking with only 31% protected instructions. Over all the considered benchmarks, 
complete fault masking was provided for 52°0 of the instructions on an average. If only the 
number of protected instructions were to be considered, gzip gives the highest amount of fault 
detection with faults in 99% of the instructions being detectable. me f again has the least 
amount of fault detection with only 46% of instructions being secure. Over all the considered 
benchmarks, fault detection was provided for 79% of the instructions on an average. 
Direct Mapped Control Caching -612 Entries 
100% 
90% 
80% 
70 
.~ 
_ ~o~ 
N 
O 
(~0~ 
4D % 
GU % 
10% 
0% 
gzip bzip 
oComplete Detection &Masking 
gcc twolf 
Benchmark 
®Fault Detection, hto Masking 
vpr mc( 
D No Detection 
Figure 6.2 Fault coverage in a 512 entry direct mapped control cache 
Figure 6.2 presents the results obtained fora 512 entry direct mapped structure for the six 
considered benchmarks. It can be seen that bzip gives the highest amount of complete fault 
masking with 91% of the instructions being protected. twol f has the least amount of complete 
fault masking with 38% protected instructions. Over all the considered benchmarks, complete 
fault masking was provided for 63°Io of the instructions on an average. If only the number of 
protected instructions were to be considered, bzip gives the highest amount of fault detection 
with faults in 94% of the instructions being detectable. twol f has the least amount of fault 
54 
detection with only 52% of instructions being secure. Over all the considered benchmarks, 
fault detection was provided for 72.5% of the instructions on an average. 
Fully Associative FIFO Replacement Control Caching -1424 Entries 
Pe
rc
en
ta
ge
 o
f I
ns
tru
ct
s 
100% 
90% 
80% 
7D% 
60% 
50% 
40% 
30% 
20% 
10% 
D% 
gzip bzip 
Detection &Masking 
gcc turolf 
Benchmark 
®Fault Detection, No Masking 
vpr mcf 
o No Detection 
Figure 6.3 Fault coverage in a 1024 entry CAM based FIFO replaced con-
trol cache 
Figure 6.3 presents the results obtained fora 1024 entry CAM based FIFO structure for 
the six considered benchmarks. It can be seen that gzip gives the highest amount of complete 
fault masking with 68% of the instructions being protected. mcf has the least amount of 
complete fault masking with 45% protected instructions. Over all the considered benchmarks, 
complete fault masking was provided for 59°0 of the instructions on an average. If only the 
number of protected instructions were to be considered, bzip gives the highest amount of fault 
detection with faults in 99°0 of the instructions being detectable. me f has the least amount of 
fault detection with only 82% of instructions being secure. Over all the considered benchmarks, 
fault detection was provided for 91.5% of the instructions on an average. 
Figure 6.4 presents the results obtained fora 1024 entry direct mapped structure for the 
six considered benchmarks. It can be seen that bzip gives the highest amount of complete fault 
masking with 92% of the instructions being protected. me f has the least amount of complete 
fault masking with 49% protected instructions. Over all the considered benchmarks, complete 
55 
Direct Mapped Control Caching -1t72d Entries 
Pe
rc
en
ta
ge
 o
f I
ns
nu
ct
i 
100 
~D% 
80°~ 
7D 
60% 
50°Jo 
40% 
30% 
20% 
10% 
0% 
gzip bzip 
oComplete Detection &Masking 
gcc twoff 
Benchmark 
®Fault Detection, No Masking 
vpr mcf 
o No Detection 
Figure 6.4 Fault coverage in a 1024 entry direct mapped control cache 
fault masking was provided for 69°0 of the instructions on an average. If only the number of 
protected instructions were to be considered, gzip gives the highest amount of fault detection 
with faults in 96% of the instructions being detectable. me f has the least amount of fault 
detection with only 71°0 of instructions being secure. Over all the considered benchmarks, 
fault detection was provided for 82% of the instructions on an average. 
Taking into consideration both the masking requirement as well as the fault detection 
requirements, a 1024 entry direct mapped structure appears to be a good trade-off, particularly 
when considering the hardware overhead of the CAM-based FIFO structure, discussed in a later 
subsection. 
It must be noted that the above values refer to the lowest possible bound for the fault 
coverage. In practice, the gerally high locality in program execution and natural masking of 
faults would ensure that the fault coverage would be much higher. 
For the opcode dependent control caching and dynamic control signal protection schemes, 
there is no fault masking, but fault detection and correction for all instruction executions. 
Due to the similar nature of control signals in the Alpha as well as the OpenRISC proces-
sor (on which the RTL implementation outlined in the next section is performed), it is fair to 
56 
assume that the relative number of different types of control signals remains approximately 
the same, more so, since both are RISC in nature. Under the above assumption, complete 
fault masking is available for 83% of all control signals (if opcode dependent control signals are 
also cached along with the instruction dependent control signals in the instruction dependent 
control caching scheme) for 69% of instruction executions on an average. Fault detection with 
correction covering all the control signals can be determined from the fact that the opcode 
dependent control caching and dynamic control protection is available for 100% of all instruc-
tion executions for 53% of all control signals, while the remaining 47% of the control signals 
are protected for 82% of all instruction executions. Denoting the percentage of opcode depen-
dent and dynamic control signals by Nopeode and NDynamic, and the percentage of instruction 
dependent control signals by NlnstrnDep, the percentage of instructions for which the opcode 
dependent and dynamic control signals are protected by Numl nstropcodeDynamicDep and the 
percentage of instruction for which the instruction dependent control signals are protected 
by Numinstri nstrnDep, the nett fault coverage value Fcoverage can be computed as shown in 
Equation 6.1. 
Fcoverage 
((NOpcode +NDynamic)NuminstrOpcodeDynamicDep + NlnstrnNuminstrinstr-nDep) 
100 
= 92.07% (6.1) 
53X100 + 47X82 
100 
6.1.2 Miscellaneous Metrics 
Most fault tolerant architectures tend to flag an alert signal when there is no cause for 
alarm. This is termed as a `false positive'. In the control caching technique, there is no scope 
for a false positive. However, there is a possibility that the same control signal of the same 
instruction can be affected in two consecutive iterations of the loop. In this case, the first 
iteration would be protected by the earlier stored signals. However, the next set of iterations 
would suffer till the erroneous signal gets overwritten in the history. However, keeping in mind 
the probability of SEUs, this situation is very unlikely. 
Another related metric of interest is fault sensitivity. The proposed technique does not 
interfere with the fault sensitivity of any of the nodes. However, if it is determined that some 
57 
of the nodes have a higher fault sensitivity, then, the technique can be applied to those nodes 
with a higher priority. 
6.2 VLSI Metrics 
All fault tolerance techniques have associated overheads. The control caching techniques 
also involves overheads in the form of extra storage required for the cache. These memory bits 
also consume some power and can have an impact on the critical path of the design also. This 
section presents a detailed analysis of the overheads associated with the static control signal 
protection techniques. 
6.2.1 Area Overhead 
The control caching technique involves the placement of a distributed cache like structure 
throughout the pipeline of the processor. Denoting the number of control bits to be protected 
by N and the number of control cache entries by K, we can determine the extra amount of 
storage required. The storage essentially is divided into three segments, the major contri-
bution coming from the cache itself. The other components of the overhead are the storage 
requirements for the tag bits and the history bits. 
The fact that a TMR structure is used for the control cache indicates that the number of 
memory elements required for the control cache, NlControlcache is given by Equation 6.2. 
NlControlCach,e = 2K1V (6.2) 
The number of memory elements required for the storage of the history, 11i1HistoryState is given 
by Equation 6.3. 
NlHistoryState = 2K (6.3) 
The number of memory elements required for the storage of the tag bits in the direct mapped 
structure, NITagBitsDM is given by Equation 6.4. 
1~TagBitsDM = K(3~ — lOg2K) (6.4) 
58 
On the other hand, the number of memory elements required for the storage of the tag bits in 
the FIFO structure, MTagBitsFlFo is given by Equation 6.5. 
MTagBitsFIFO = 3OK (6.5) 
Considering the opcode dependent control caching next and denoting the number of opcode 
dependent control signals by Nop and the number of distinct opcodes to index into the cache 
by Kop, the memory overhead without the ECC is given by Equation 6.6. 
NlOpcodeControlCache =KopNop (6.6) 
The complete memory overhead for the direct mapped structure is given by Equation 6.7 
and is the sum of the individual overheads 
MOverheadDM = MControlCache ~" NlHistoryState + MTagBitsDM + MOpcodeControlCache 
= K(2(N + 1) + 30 — log2K) +KopNop (6.7) 
The complete memory overhead for the FIFO structure is given by Equation 6.8 and is the 
sum of the individual overheads. 
NlOverheadFlFO = MControlCache + NlHistoryState + ~ITagBitsFIFO +' NlOpcodeControlCache 
= K(2(N + 1) + 30) +KopNop (6.8) 
Analysis of the OpenRISC 1200 reveals that there are 53 distinct opcode combinations. 
Using Equation 6.7, it can be determined that a 512 entry direct mapped structure for control 
caching in the OpenRISC 1200 would require 37.5 kbits, while a 1024 entry cache based on 
the same configuration would take up 73 kbits. On the other hand, for a CAM based FIFO, 
Equation 6.8 suggests that a 512 entry structure would require 42 kbits of memory, while 1024 
entries would require 83 kbits. In comparison to the huge sizes of modern processor caches, 
the area contributed by less than a hundred kilobits will not be significant. 
6.2.2 Operating Frequency Penalty 
The additional operations in the proposed scheme mostly take place in parallel with the 
already existing operations of the datapath. Hence, it is highly likely that the majority of the 
59 
additions do not figure in the critical path of the design. Analysis of the RTL implementation 
revealed that the addition of the history and tag tracking mechanism in the fetch stage had a 
penalty on the operating frequency. To counter this, it was decided to internally pipeline the 
fetch stage such that every instruction spends two cycles in it. The tag tracking and update 
were designed to be executed in two cycles. This affects the latency, but not the processor 
throughput. 
On the other hand, if the protected control signals are needed in the beginning of a pipeline 
stage to proceed with the operations, some delay is added to the path, as discussed below. The 
discussion is with reference to the instruction dependent control caching scheme only. A similar 
discussion applies to the opcode control caching scheme also. Initially, the control cache needs 
to be accessed and the stored history read. The time for this is denoted by tecacheAccess • This 
data is input to the majority function module, which has at most two gate delays denoted by 
tMajorityFunction• There is a multiplexing delay of tn,1ux involved in determining whether the 
output of the majority function module or the original signal itself is to be utilized. Thus, a 
possible overhead of tccacheoverhead given by Equation 6.9 is added to the critical path timing. 
tCCacheoverhead — tCCacheAccess + tMajorityFunction + tMv,a; (6.9) 
In most processors, these delays are not as significant as the delays involved in the rest 
of the operations performed in the pipeline stages. In the case that these delays severely 
hamper the operating frequency of the design, the control cache segment of each signal can be 
moved to the stage prior to the pipeline stage in which it is being utilized. Thus, the delays 
involved would be masked by the operations in the previous stage. However, the control signal 
is exposed to SEUs at the pipeline register storing the protected value. 
6.2.3 FPGA Synthesis Results 
Three different RTL models were subjected to synthesis on a Xilinx Virtex FPGA using 
the Xilinx ISE 7.1 synthesis tool. Table 6.1 summarizes the results. The slice utilization 
goes up by 19% for the CAM based FIFO model and by 16% for the direct mapped model in 
comparison to the non-fault tolerant version of the processor. The flip-flop usage goes up by 
60 
Table 6.1 FPGA Synthesis Results for OpenRISC 1200 
RTL Model (Fault tolerance hardware) 
OR1200 (No fault tolerance) 
OR1200 (1024 entry CAM based FIFO) 
OR1200 (1024 entry direct mapped) 
Gate Count 
1,140,805 
1,610,570 
1,544,827 
Operating frequency 
83.15 MHz 
66.02 MHz 
78.07 MHz 
10% for the CAM while it is only 3% more for the direct mapped structure. The penalty in 
operating frequency is just 6% in the direct mapped case, while it is as much as 21% in the 
CAM based FIFO case. It is due to the fact that the implementation of large sized CAMs in 
FPGAs are not efficient. High performance CAMS are possible in ASIC implementation. It 
appears that inspite of theoretically being superior, CAM based FIFO replaced control caches 
are not suitable for practical implementations in comparison to direct mapped structures. 
6.3 Summary 
This chapter presented an analysis of the various metrics for evaluating the effectiveness 
of the control caching technique. The various fault tolerance metrics were discussed first, and 
the fault coverage results were detailed out. This was followed by a theoretical analysis of the 
area and cycle time overhead of the technique. The chapter concluded with the presentation 
of the FPGA synthesis results. 
61 
CHAPTER 7. Summary and Fl..iture Work 
An architectural technique, 'Control Caching', has been outlined in the presented research. 
It provides fault tolerance for one of the most difficult to protect segments of the microprocessor, 
namely, the control lines, against SEUs. This chapter summarizes the results and compares the 
scheme with related work. An outline of possible future work is detailed, and the conclusions 
of the thesis are discussed. 
7.1 Related Schemes and Comparisons 
A fault-tolerant architecture for the control logic of small microcontrollers is described in 
[22]. It assumes that the control logic is implemented with a FSM, and implements a coding 
scheme for protecting the state registers. This type of scheme is not applicable for protecting 
the control logic of huge complex microprocessors. 
Another scheme suggested is the duplication of the microcode ROM unit, or at least seg-
ments of it, and a lookup based on the opcode of the instructions. However, this scheme suffers 
from some fundamental shortcomings in the sense that it is unable to protect control signals 
of present day RISC processors where the control signals can vary dependent upon the data 
dependencies. For example, the select line of the multiplexers at the input of the ALU can 
choose the forwarding logic output in case of a data dependency. These types of program 
dependent control signals are not captured by the suggested scheme, which fails to identify 
the temporal aspects of the program when concentrating on the spatial redundancy aspect of 
fault tolerance. 
The area of fault tolerance for control logic in microprocessors is dealt with in detail in [21] . 
The technique outlined in the paper is also dependent upon the aspect of programs running 
62 
in loops. However, the scheme suffers from shortcomings in the following respects. 
• The XOR operation involved in [21] could possibly cancel out errors in two control signals 
of the same instruction in different pipeline stages. Thus, a comparison against the stored 
signature in the signature cache would yield no mismatch, when, in fact, the execution 
of the instruction is possibly incorrect 
• A single upset in the signature cache can cause all the executions of the instruction 
corresponding to that signature entry to flag an error, when there is no problem with 
the execution. This implies that the logic which is responsible for the re-execution of the 
instructions will be kick-started into action unnecessarily 
• A memory bit upset in this technique will never be eliminated, unless the control cache 
entry is replaced. Thus, for a loop executing, say, 1000 times, if a memory bit upset occurs 
in the first few iterations, there is no scope of recovery for the rest of the iterations 
The `Control Caching' technique can counter the above anomalies at the cost of some extra 
storage requirements. 
• Errors in multiple control signals are not coupled together, since the technique affords 
protection for them independent of each other. 
• SEUs in the extra memory bits are also protected in the scheme, because the entire 
scheme works on the assumption that at least two of the three bits involved in the TMR 
module are correct. 
• There is absolutely no possibility of the technique flagging snore than two false positives, 
because the cache entries corresponding to a particular instruction get updated every 
iteration, and memory bit upsets in the cache get overwritten. 
• Given the frequency of SEUs, the proposed technique will ensure that there is no need 
for executing instructions again, while [21] will detect an error and flag the necessity for 
re-execution irrespective of the point of occurrence of the fault in the time domain. Thus, 
penalty in the time domain is almost non-existent in the proposed technique. 
63 
7.2 F~.irther Work 
There are quite a number of avenues to extend the solutions provided in this thesis, as 
detailed below. 
• The presented technique will require some minor architectural modifications for imple-
mentation on multi-issue superscalar processors. 
• Recovery schemes can be tried out with compiler support for processors involving spe-
cialized commit stages. 
• Set associative caching structures can be explored to determine if higher fault coverage 
is provided. 
7.3 Conclusions 
A technique comprised of three component schemes has been presented to protect the 
control signals of the datapath of a microprocessor, after classifying each of them as being either 
static or dynamic in nature. Dynamic control signals are protected by selective duplication of 
datapath components. The scheme to protect static control signals involves the placement of a 
distributed cache to store the control signals of the instructions. Instruction dependent control 
signals are protected if they belong to instructions being executed as part of looping structures. 
Opcode dependent control signals are protected in all iterations. A direct mapped structure for 
the cache has been found to be most effective taking both the coverage and resource utilization 
into consideration. The technique was implemented on the OpenRISC 1200 processor and 
FPGA synthesis was performed. Fault coverage metrics for both 512 and 1024 cache entries 
have been determined, and the technique has been shown to protect 92% of all instruction 
executions with minimal area and cycle time overheads. A host of advantages are offered by 
the proposed technique in comparison with the previous related work. The implementation of 
the technique is simple and there is a lesser overall overhead since penalty cycles are avoided 
due to the absence of the need for instruction re-execution. The components of the scheme are 
also self-correcting. Transient faults of longer durations are also handled effectively since the 
64 
cache entry of a particular instruction is updated only once per loop iteration. The scheme can 
also handle errors in the control signals of the same instruction in various pipeline stages. The 
various avenues available for extending the technique have also been analyzed and discussed. 
65 
BIBLIOGRAPHY 
[1] R.C. Baumann. Soft errors in commercial semiconductor technology: Overview and scaling 
trends. In the IEEE ~00~ Reliability Physics Tutorial Notes, Reliability Fundamentals, pp. 
1 X1.01.1 121.01.11, April 2002. 
[2] Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo 
Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational 
Logic. In the Proc. of the International Conference on Dependable Systems and NetworJ~s 
X002, Bethesda, MD, pp. 389-398, June 2002. 
[3] M. Rebaudengo, M. Sonia Reorda, M. Violante, Ph. Cheynet, B. Nicolescu, and R. Ve-
lazco. Coping with SEUs~SETs in Microprocessors by means of Low-Cost Solutions: A 
comparative study and Experimental results. In the IEEE Transactions On Nuclear Sci-
ence, Vol. 19, No. 3, pp. 1191-195, June 2002. 
[4] P. Hazucha P, C. Svensson, and S.A. Wender. Cosmic-Ray Soft Error Rate Characteri-
sation of a Standard 0.6µm CMOS Process. In the IEEE Journal of Solid-State Circuits, 
Vol. 35, No 10, pp. 1/~~~-1/~9, 2000. 
[5] Y. Tosaka, H. Kanata, S. Satoh, and T. Itakura. Simple Method for Estimating Neutron-
Induced Soft Error Rates Based on Modified BGR Model. In the IEEE Electron Device 
Letters, Vol. ~0, No 2, pp. 89-91, 1999. 
[6] C.L. Chen, and M.Y. Hsiao. Error-correcting Codes for Semiconductor Memory Applica-
tions: A State of the Art Review. Reliable Computer Systems -Design and Evaluation, 
pp. 771-786, Digital Press, end edition, 1992. 
66 
[7] D. Mavis, and P. Eaton. Soft Error Rate Mitigation Techniques for Modern Microcircuits. 
In the Proc. of the .~Oth Annual Reliability Physics Symposium, Dallas, TX, pp. X16-~~5, 
April 2002. 
[8] P. Cheynet, B. Nicolescu, R. Velazco, M. Rebaudengo, M. Sonza Reorda, and M. Violante. 
Experimentally Evaluating an Automatic Approach for Generating Safety-Critical Soft-
ware with Respect to Transient Errors. In the IEEE Transactions on Nuclear Science, 
Vol. ~ 7, No 6, pp. ~~31-~~36, December 2000 
[9] P.P. Shirvani, N. Saxena, and E.J. McCluskey. Software Implemented EDAC Protection 
Against SEUs. In the IEEE Transactions on Reliability, Vol. /9, No. 3, pp. X73-~81~, 
September 2000. 
[10] A. Benso, S. Chiusano, P. Prinetto, and L. Tagliaferri. A C/C-I-+ source-to-source com-
piler for dependable applications. In the Proc. of the International Conference on Depend-
able Systems and Networl~s X000, New Yorl~, NY, pp. 71-78, June 2000. 
[11] K.H. Huang, and A.J. Abraham. Algorithm Based Fault-Tolerance for Matrix Operations. 
In the IEEE Transactions on Computers, Vol. 33, pp. 518-5~8, December 1984. 
[12] Z. Alkhalifa, V.S.S. Nair, N. Krishnamurthy, and A.J. Abraham. Design and Evaluation of 
System-level Checks for On-Line Control Flow Error Detection. In the IEEE Transactions 
on Parallel and Distributed Systems, Vol. 10, No 6, pp. 6~7-6.~1, June 1999. 
[13] G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August. SWIFT: Software 
Implemented Fault Tolerance. In the Proc. of the 3''d International Symposium on Code 
Generation and Optimization (CGO), San Jose, CA, pp. ~/3-~5/~, March 2005. 
[14] T.J. Slegel, et al. IBMs S/390 G5 microprocessor design. In the IEEE Micro, Vol. 19, No 
~, pp. 1 ~-~3, March /April 1999. 
[15] Compaq Computer Corporation. Data integrity for Compaq Nonstop Himalaya servers. 
1999. 
67 
[16] E. Rotenberg. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Micro-
processors. In the Proc. of the 9th Annual International Symposium on Fault-Tolerant 
Computing, Madison, WI, pp. 8/~-91, June 1999. 
[17] Todd M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture De-
sign. In the Proc. of the 3~''d Annual International Symposium on Microarchitecture, 
Haifa, Israel, pp. 196-~07, November 1999. 
[18] Seongwoo Kim, and Arun K. Somani. SSD: An Affordable Fault Tolerant Architecture for 
Superscalar Processors. In the Proc. of the X001 Pacific Rim International Symposium on 
Dependable Computing, Seoul, Korea, pp. ~7-3/~, December 2001. 
[19] J.B. Nickel, and A.K. Somani. REESE: a method of soft error detection in microprocessors. 
In the Proc. of the International Conference on Dependable Systems and Networl~s X001, 
Gteborg, Sweden, pp. /~ 01-.~ 10, July 2001. 
[20] Joydeep Ray, J.C. Hoe, and Babak Falsafi. Dual use of superscalar datapath for transient-
fault detection and recovery. In the Proc. of the 4th Annual International Symposium on 
Microarchitecture, Austin, TX, pp. ~1/~-~~/~, December 2001. 
[21] Seongwoo Kim, and Arun K. Somani. On-Line Integrity Monitoring of Microprocessor 
Control Logic. In the Proc. of the 19th IEEE International Conference on Computer De-
sign, Austin, TX, pp. 011t-0~1, September 2001. 
[22] E. Cota, F. Lima, S. Rezgui, L. Carro, R. Velazco, M. Lubaszewski, and R. Reis. Synthesis 
of an 8051-Like Micro-Controller Tolerant to Transient Faults. In the Springer Science 
Journal of Electronic Testing, Vol. 17, Issue ~, pp. 1/9-161, April 2001. 
[23] D. A. Patterson and J. L. Hennessy. Computer Architecture, A Quantitative Approach. 
Morgan Kaufmann, 1996. 
68 
[24] J. Karlsson, P. Folkesson, et. al. Application of Three Physical Fault Injection Techniques 
to Experimental Assessment of the MARS Architecture. In the Proc. of the 5t~ IFIP 
Working Conference on Dependable Competing, DCCA-5, Urbana-Champaign, IL , pp. 
150-153, September 1995. 
[25] Seongwoo Kim, and Arun K. Somani. Soft Error Sensitivity Characterization for Micro-
processor Dependability Enhancement Strategy. In the Proc. of the International Con-
ference on Dependable Systems and Networks ~00~, Bethesda, MD, pp. /16-/~~8, June 
2002. 
[26] Pradip A. Thaker, Vishwani D. Agrawal, and Mona E. Zaghloul. Register-Transfer Level 
Fault Modeling and Test Evaluation Techniques for VLSI Circuits. In the Proc. of the 
International Test Conference X000, Atlantic City, NJ, pp. 910-9.~9, October 2000. 
[27] Charles R. Yount, and Daniel P. Siewiorek. A Methodology for the Rapid Injection of 
Transient Hardware Errors. In the IEEE Transactions on Computers, Vol. 1~5, Issee 8, 
pp. 881-891, August 1996. 
[28] D. Burger and T. Austin. The Simplescalar tool set, version 2.0. Technical Report 13/~~, 
University of Wisconsin, Madison. Computer Science Dept., 1997. 
[29] D. Lampret, OpenRISC 1200 Specification, OPENCORES.ORG., URL: 
http://www.opencores.org/cvsget.cgi/orlk/or1200 (Date retrieved: 20th October 
2005) 
[30] M. Rebaudengo, M. Sonia Reorda, and M. Violante. Analysis of SEU Effects in a 
Pipelined Processor. In the Proc. of The Sty IEEE International On-Line Testing Work-
shop (IOLTW'0~), Isle of Bendor, France, pp. 11~-116, July 2002. 
[31] C. Constantinescu. Neutron SER Characterization of Microprocessors. In the Proc. of the 
International Conference on Dependable Systems and Networks X005, Yokohama, Japan, 
pp. 75/~-759, June 2005. 
69 
ACKNOWLEDGEMENTS 
I would like to gratefully acknowledge my advisor, Professor Arun Somani, for his support 
and encouragement during my graduate studies. I was always inspired by his enthusiasm for 
the field. His questions and observations provided valuable guidance for the research presented 
here. I would also like to thank Professor Akhilesh Tyagi and Professor Gurupur Prabhu for 
kindly agreeing to serve on my thesis committee. I am also grateful to Professor Manimaran 
Govindarasu for his comments and suggestions when the thesis work was still in its infancy. 
This research was partially supported by NSF grant number 0311061 and the Jerry R. 
Junkins Endowment at Iowa State University. I would like to thank NSF and Iowa State 
University for the support. 
The work presented in this thesis would not be in its final shape if it were not for my 
colleague, Viswanathan Subramanian, in my research group. I am indebted to him for his 
invaluable help in simulating and debugging the design. I am also thankful to Mikel Bezdek 
and Mercado Ramon for their help during the initial segment of the work. I also thank 
Mahadevan Gomathisankaran and Natarajan Viswanathan for their help in taking this thesis 
closer to perfection. 
Finally, I would like to thank all my friends who have been by my side during my stay at 
Ames. In no particular order, my thanks go to Veerendra Allada, Swamy Ponpandi, Srikanth 
Kirthivasan, Anantharaman Kalyanaraman, Srinivas Neginhal, Samarth Shetty, Srivatsan Bal- 
asubramanian, Kavitha Balasubramanian, Vidya Iyer, Satyadev Nandakumar and Anupreet 
Kaur for their support and encouragement which made it possible for me to complete and 
defend my thesis with confidence. 
