Soft error sensitivity characterization for microprocessor dependability enhancement strategy by Kim, Seongwoo & Somani, Arun K.
Electrical and Computer Engineering 
Conference Papers, Posters and Presentations Electrical and Computer Engineering 
2002 
Soft error sensitivity characterization for microprocessor 
dependability enhancement strategy 
Seongwoo Kim 
Iowa State University 
Arun K. Somani 
Iowa State University, arun@iastate.edu 
Follow this and additional works at: https://lib.dr.iastate.edu/ece_conf 
 Part of the Digital Communications and Networking Commons, and the Systems and 
Communications Commons 
Recommended Citation 
Kim, Seongwoo and Somani, Arun K., "Soft error sensitivity characterization for microprocessor 
dependability enhancement strategy" (2002). Electrical and Computer Engineering Conference Papers, 
Posters and Presentations. 176. 
https://lib.dr.iastate.edu/ece_conf/176 
This Conference Proceeding is brought to you for free and open access by the Electrical and Computer Engineering 
at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer 
Engineering Conference Papers, Posters and Presentations by an authorized administrator of Iowa State University 
Digital Repository. For more information, please contact digirep@iastate.edu. 
Soft error sensitivity characterization for microprocessor dependability 
enhancement strategy 
Abstract 
This paper presents an empirical investigation on the soft error sensitivity (SES) of microprocessors, 
using the picoJava-II as an example, through software simulated fault injections in its RTL model. Soft 
errors are generated under a realistic fault model during program run-time. The SES of a processor logic 
block is defined as the probability that a soft error in the block causes the processor to behave 
erroneously or enter into an incorrect architectural state. The SES is measured at the functional block 
level. We have found that highly error-sensitive blocks are common for various workloads. At the same 
time soft errors in many other logic blocks rarely affect the computation integrity. Our results show that a 
reasonable prediction of the SES is possible by deduction from the processor's microarchitecture. We 
also demonstrate that the sensitivity-based integrity checking strategy can be an efficient way to improve 
fault coverage per unit redundancy. 
Keywords 
microprocessor chips, circuit reliability, fault tolerant computing 
Disciplines 
Digital Communications and Networking | Systems and Communications 
Comments 
This is a manuscript of a proceeding published as Kim, Seongwoo, and Arun K. Somani. "Soft error 
sensitivity characterization for microprocessor dependability enhancement strategy." In Proceedings 
International Conference on Dependable Systems and Networks, pp. 416-425. IEEE, 2002. DOI: 10.1109/
DSN.2002.1028927. Posted with permission. 
This conference proceeding is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/
ece_conf/176 
Soft Error Sensitivity Characterization for Microprocessor Dependability
Enhancement Strategy
Seongwoo Kim and Arun K. Somani
Department of Electrical and Computer Engineering
Iowa State University, Ames, IA 50010-3060, USA,fskim,arung@iastate.edu
Abstract
This paper presents an empirical investigation on the
soft error sensitivity (SES) of microprocessors, using the
picoJava-II as an example, through software simulated fault
injections in its RTL model. Soft errors are generated un-
der a realistic fault model during program run-time. The
SES of a processor logic block is defined as the probability
that a soft error in the block causes the processor to be-
have erroneously or enter into an incorrect architectural
state. The SES is measured at the functional block level.
We have found that highly error-sensitive blocks are com-
mon for various workloads. At the same time soft errors
in many other logic blocks rarely affect the computation in-
tegrity. Our results show that a reasonable prediction of the
SES is possible by deduction from the processor’s microar-
chitecture. We also demonstrate that the sensitivity-based
integrity checking strategy can be an efficient way to im-
prove fault coverage per unit redundancy.
1. Introduction
Microprocessors are used in a variety of applications
from small calculators to multi-million dollar servers. As
we become more dependent upon microprocessor-based
systems, increasing attention is paid not only to the proces-
sors’ computing throughput but also its dependability. Even
if a microprocessor is shipped with no design errors or man-
ufacturing defects, unstable environmental conditions can
generate temporary hardware failures. These failures, called
transient faults, cause the processor to malfunction during
operation time. The major sources of transient faults are
electro-magnetic interference, power jitter, alpha particles,
and cosmic rays. Studies in [1], [2], [3] show that the vast
majority of detected errors originate from transient faults.
An incorrect state of a processor component manifesting
the transient fault is referred to as asoft error. The soft
error can propagate to other components. Even a single-bit
error may eventually lead the processor to computation fail-
ure. Therefore, managing the soft error is a critical problem
to solve to fully realize dependable computing.
Advances in VLSI technology have reduced circuit di-
mensions and improved processor performance dramati-
cally, yet these advances are offset by an increased vulner-
ability to soft errors. Careful design practices, new fabrica-
tion materials with better fault-immunity and circuit-level
techniques may lower the soft error rate, but it is not suffi-
cient to prevent the errors from occurring. External distur-
bances cannot be completely blocked by enhanced shielding
and packaging. Therefore, we attempt to design the proces-
sors to be tolerant of harmful impacts caused by soft errors.
Unfortunately, research in the area of processor dependabil-
ity has been generally limited to a few critical applications.
A wide choice of fault tolerant processors should be made
available for diverse applications.
Understanding the processor’s behavior in the presence
of soft errors has a fundamental value in devising fault tol-
erant techniques. Fault injection methodologies can be used
for that purpose [4], [5]. Faults are intentionally created in
the processor with special software and/or hardware tools
and the operations are monitored. Ideally, the criteria of de-
termining the protection requirement for a processor should
include soft error rate and actual failure observation anal-
ysis, but mostly they are approximated. Even though rea-
sonably accurate information is available to microprocessor
manufacturers, it is generally not available to researcher for
long time. Soft error studies on commercial products have
been rare in academia because of limited access to detailed
processor models and/or experiment equipments. There-
fore, it is not easy to find comprehensive knowledge of soft
error characteristics for various microprocessors.
This paper presents a case study of soft error characteri-
zation using picoJavaTM -II, which is a microprocessor core
developed by Sun Microsystems. The register transfer level
(RTL) model of the picoJava-II became publicly available in
1999 to enable different groups of researchers to study, ex-
tend, and improve this commercial product. Our major ef-
fort here is to gain a good insight into the picoJava-II core’s
behavior under faulty environment. We employ a software
simulated fault injection method and observe how much the
core is susceptible to transient faults while executing pro-
grams. We believe that these results will be useful for fault
tolerant microprocessor design. We also identify and char-
acterize dominant factors that affect the processor’s sensi-
tivity to the faults. This kind of investigation results into
understanding that can be used by the designers to signifi-
cantly reduce the overhead for integrity checking and max-
imize protection capability. This paper also substantiates
such a synergetic case and provides a guideline for low-cost
dependability enhancement.
2. Fault Injection into Microprocessors
Transient faults can be injected into a microprocessor
in many ways. Each method offers different controllabil-
ity over the time and location of the fault injection, level
of perturbation to the processor, and simulation time and
cost requirement. Commonly used hardware methods are
pin-level injection [6], [7], heavy-ion radiation [8], [9], and
electro-magnetic disturbances [10], [11]. Recently, non-
destructive laser has also been introduced [12], [13]. All of
these methods closely imitate real fault situations, but they
are usually expensive and applicable only after the phys-
ical chip is available. On the other hand, software fault
injection is a low-cost method that can also be applied to
programs and operating systems as well [14], [15]. Soft-
ware methods can be classified into two classes. The first
class issoftware-implemented method, where the proces-
sor state or programs are modified during compile or run
time and the injection takes place on real hardware. The
other class issimulation-based method, where the proces-
sor, workload, and fault injections are all modeled in soft-
ware simulation. In general, the latter is more flexible than
the former as it provides better controllability of fault injec-
tion and observability of system behavior. In this paper, we
use a simulation-based approach.
Several tools have been developed to automate fault in-
jection experiments and analyze the observations made dur-
ing the experiments [4], [5]. GOOFI in [16] is another
object-oriented injection tool that is designed to be portable
to different platforms. The efficiency of diverse fault in-
jection tools is compared in [17]. An advanced tool re-
duces simulation time by conducting more than one injec-
tion simultaneously, and also supports event handling mech-
anisms and multiple system/fault models. Since fault simu-
lation space is so large, it is always very challenging to ob-
tain accurate behavior analysis in an acceptable time frame.
Therefore, proper fault injection tool and technique needs to
be selected after careful examination for each target proces-
sor. In our study, we adapt existing techniques, but suitably
modify them for the picoJava-II core and our experimental
environment. This is explained in detail in Section 4
Emulated fault models affect the fault manifestation.
The most common method to model the transient fault is
logic inversion, where each fault flips some logic signals
temporarily. A study in [18] indicates that error behav-
ior modeling is dependent on workload and hence various
workloads need to be considered. However, there may also
be common characteristics over different workloads that can
be used in fault tolerance design. For example, SimR2K, a
32-bit RISC tested in [19], exhibited very similar sensitivity
patterns when faults were injected for several workloads.
More importantly, the effects of transient faults strongly
vary with processor architecture and possibly fault injec-
tion methodology. In [20], a jet engine controller called
HS1602 was upset by current and voltage transients. The re-
sults show that faults in the arithmetic unit are most likely to
propagate and result in logic failure. In another experiment,
RTL model of the IBM RT PC was injected with single-
cycle inverted transient faults in [21]. About 6070% of
injected faults were overwritten. The study also reports that
the attributes of the workload such as instruction types and
control flow structures are good indicators of error behav-
ior. Nevertheless, this claim has not been fully verified and
the hardware organization might have more direct impacts.
Another software modeled 32-bit RISC, called TRIP, was
tested using VHDL in [22]. The fault injection was per-
formed by toggling the value of randomly chosen internal
state element bits. While 34% of faults were overwritten at
run-time, only 23% of faults wereffective, i.e., the faults
resulted in processor failure. It should be noted that pro-
cessors are capable of masking out some faults without any
intended fault protection mechanism.
Other systems or processors investigated for error be-
havior include MC6809E with heavy-ion radiation and
power supply disturbance [9], MC68000 with device-level
simulation [23], SPARC1 system with physical injection
[24], MC88100 with combination of software-implemented
and simulation-based fault injection [25], DP32 [26] and
MC68302 [27] with VHDL simulations, and MARK2 with
simulated stuck-at and open-line faults [28]. All these stud-
ies support the fact that each processor has a distinct level
of sensitivity to soft errors, and therefore, a new design re-
quires separate dependability evaluations and may be engi-
neered for lower sensitivity.
Once a soft error occurs in a logic block of a processor,
its propagation nature is mainly defined by the architecture
and workload of the processor. On the other hand, how of-
ten the soft error occurs,transient upset rate, is affected by
the fabrication process and circuit technology. More upsets
mean higher probabilities of soft error occurrence. In [29],
the same heavy ion was individually radiated into three units
of an ERC32 processor, and upset rates were different be-
cause the units employed diverse circuit types. Errors were
observed mostly in the register file and some in the combi-
national logic. Circuits of the integer unit were more sus-
ceptible to the ions than those of floating point and memory
control units. Another radiation testing on 486DX4 micro-
processors [30] shows that different implementations of a
common processor architecture result in susceptibility vari-
ation. When six 486DX4 processors from AMD and Intel
were bombarded with radiation beams, AMD’s chips were
more susceptible than Intel’s. In the experiment, the feature
size of AMD’s was smaller than Intel’s.
This paper focuses on what impacts soft errors make on
the computation (in picoJava-II core) rather than how often
faults generate the soft errors in the processor. Thus, our
study is independent of implementation and process tech-
nology. The purpose is to obtain susceptibility indices for
various parts to create a base for the development of archi-
tectural solutions.
3. PicoJava-II Microprocessor
The picoJava-II is a microprocessor core uniquely de-
signed for directly executing Java byte-code instructions de-
fined by the JavaTM Virtual Machine (JVM) in hardware,
and it is used for cost-sensitive embedded applications [31].
The core consists of a six-stage RISC pipeline and instruc-
tion folding capability (a process of loading and execut-
ing an instruction in a single cycle). Chips based on the
picoJava-II are well suited for a wide range of information
appliances such as digital set-top boxes, Internet TVs, per-
sonal digital assistants, etc. As the use of such products in-
creases, their malfunctions due to transient faults may cause
serious loss of time, money, or even worse. Dependabil-
ity will become a more important quality measure for these
products. Therefore, sufficient estimation and enhancement
of dependability are imperative.
The picoJava-II core we use in this study is a soft in-
tellectual property (IP) as opposed to hard IP under the
Sun community source licensing program [32]. The soft
IP described in Verilog, a hardware description language
for logic design, specifies the detailed implementation of
the picoJava-II core and completely simulates the real chip.
To make the chip more suitable for a particular application,
some features of the core can be reconfigured. However, we
















































Figure 1. Block diagram of the picoJava-II.
A full description of the picoJava-II core architecture is
presented in [32]. Figure 1 illustrates our floor plan for an
implementation of the picoJava-II. The area ratios among
functional unit blocks (FUBs) are approximated by using
picoJava-II synthesis results from [33]. When the area is
measured in terms of 2-input NAND equivalent gates, more
than 69% of the chip is devoted to cache memory arrays
and control logic. The FUBs of the processor core are
(1) integer unit (IU); (2) instruction cache unit (ICU); (3)
instruction cache tag (ITAG); (4) instruction cache RAM
(ICRAM); (5) data cache unit (DCU); (6) data cache tag
(DTAG); (7) data cache RAM (DCRAM); (8) floating point
unit (FPU); (9) stack manager unit (SMU); (10) power-
down, clock, and scan unit (PCSU); (11) bus interface unit
(BIU); and external memory controller (MEMC).
The IU decodes and executes instructions from an in-
struction buffer (I-buffer). It forwards floating point in-
structions to the FPU and communicates with the DCU for
data. The IU consists of a 32-bit ALU and shifter, a mul-
tiply/divide unit, a microcode ROM implementing multi-
cycle instructions, registers, a 64-entry stack cache (SC),
and trap generation/dependency checking/forwarding logic.
The ICU controls a 16-KB direct-mapped instruction cache
(I-cache) and the I-buffer. It fetches and dispatches instruc-
tions to the decode unit of the IU. The ITAG contains 1024
19-bit tags. Each tag corresponds to its associated line of
4 words in an ICRAM. The ICRAM holds instructions in a
RAM of 2048 entries by 64 bits. The DCU handles sourcing
and sinking of data from/to a data cache (D-cache) for load
and store instructions. The D-cache consists of a 102424-
bit DTAG and a 409632-bit DCRAM, configured for 16-
KB with 16-byte line, 2-way set associative, write-back, and
write-allocate. The FPU executes the floating point instruc-
tions. The SMU handles overflow and underflow conditions
of the SC. The PCSU integrates power management, clock
generation, system reset, scan, and test. The BIU is the in-
terface between the core and external world via the MEMC.
The MEMC is the interface between the BIU and external
devices such as memory and I/O.
4. Experimental Setup
We conducted a set of transient fault injection simula-
tions on the RTL model of the picoJava-II processor. This
model has been synthesized, validated, and optimized by
several commercial vendors and universities. Based on a
fault model we defined, the faults were simulated entirely
in software while each application program was running.
Our fault model basically follows the conventional signal
inversion approach, but it is not restricted to a single-bit
failure. The probability of a fault occurrence is uniformly
distributed over operation time and logic location. A fault
in a logic block is manifested as a logic value toggle of its
output signals, from 0 to 1 or from 1 to 0. Multiple signal
bits can be corrupted by a single fault.
In our experiments, fault injection location (FILs) in the
processor is determined on a minimal logic block basis.
Each FUB shown in Figure 1 is divided into sub-blocks af-
ter examining error propagation paths with test generation
rules. If logic elements in a FUB have common fault effects
on an output signal of the FUB, i.e., fault equivalent logic,
they are grouped together and treated as a single FIL. Con-
sequently, a FIL is a logic sub-block responsible for produc-
ing an output signal of a FUB, and a fault in the FIL means
a soft error in the corresponding output signal.
The fault injections are carried out through software in-
terfaces appended to the processor. Our interface is placed
at each output port of the FUB, and it takes the output signal
and produces a simulated output signal under the fault sim-
ulation control. This simulation control is another software
module that allows us to control the fault type and injection
timing and duration. The total number of error bits per in-
jection is chosen randomly with a minimum value of 1. The
fault mask randomly specifies bit positions for the signal in-
version. Although the injection interfaces are embedded in
the processor simulator, they are completely transparent to
any other component and active only in fault simulation.
Table 1 lists all FILs identified for the picoJava-II FUBs
and their output signals. One can further break down each
FIL into smaller sub-FILs at a lower level of the RTL model
hierarchy, but it does not provide significant advantages in
understanding fault behavior despite the need for a large in-
crease in evaluation time. For example, 32 bit slices for
an address adder in FIL 1 can be separately examined, but
additional knowledge we expect to gain from that is very
small. Each FIL can include both random logic and mem-
Table 1. The FILs of picoJava-II FUBs and their output signal.
26 iu_rs2_e 32 54 dcu_data 32 DTAG 82 hit1 1
27 fpop 8 55 iu_stall 1 DCRAM 83 dcram_dout 64
Name 28 fpop_valid 1 56 dcu_err_ack 3 84 fp_rdy_e 1
1 iu_addr_e 32 29 iu_kill_fpu 1 57 dcu_in_powerdown 2 85 fpu_data_e 32
2 iu_br_pc 32 30 hold_fpu 1 58 dcu_din_e 32 86 smu_rf_din 32
3 iu_brtaken_e 1 31 icu_data 56 59 dcu_ram_we 4 87 smu_rf_addr 6
4 iu_icu_flush_e 1 32 icu_drty 7 60 dcu_pwrdown 1 88 smu_we 1
5 iu_psr 32 33 icu_vld_d 7 61 dcu_stat_addr 13 89 smu_sbase 30
6 iu_shift_d 8 34 icu_length_d 28 62 dcu_bank_sel 2 90 smu_sbase_we 1
7 iu_data_e 32 35 icu_pc_d 32 63 dcu_bypass 1 91 smu_hold 1
8 iu_kill_dcu 1 36 icu_hold 1 64 dcu_tag_in 19 92 smu_addr 32
9 kill_inst_e 1 37 icu_din 32 65 dcu_stat_out 5 93 smu_st 1
10 iu_special_e 1 38 icu_ram_we 2 66 dcu_set_sel 1 94 smu_ld 1
11 iu_dcu_flush_e 3 39 icram_powerdown 1 67 wb_set_sel 1 95 smu_data 32
12 iu_inst_e 8 40 icu_addr 14 68 dcu_tag_we 1 96 smu_na_st 1
13 iu_zero_e 1 41 icu_tag_in 18 69 dcu_stat_we 5 97 pj_nmi_sync 1
14 iu_data_in 32 42 icu_tag_vld 1 70 dcu_addr_out 32 98 pj_irl_sync 4
15 iu_sbase_we 1 43 icu_tag_we 1 71 smu_stall 1 99 pcsu_powerdown 1
16 iu_optop_din 32 44 icu_in_powerdown 1 72 smu_data_vld 1 100 pj_dcuack 2
17 iu_optop_int_we 1 45 pj_icureq 1 73 dcu_smu_st 1 101 pj_icuack 2
18 ret_optop_update 1 46 pj_icusize 2 74 pj_dcureq 1 102 pj_datain 32
19 iu_smu_flush 1 47 pj_icutype 1 75 pj_dcusize 2 103 pj_data_out 32
20 iu_rf_dout 32 48 pj_icuaddr 32 76 pj_dcutype 3 104 pj_address 30
21 iu_smiss 1 49 itag_dout 18 77 pj_dcuaddr 32 105 pj_size 2
22 iu_smiss_addr 32 50 itag_vld 1 78 pj_dataout 32 106 pj_type 4
23 iu_smiss_data 32 51 ic_hit 1 79 tag_dout 19 107 pj_tv 1
24 iu_powerdown_op_e 1 ICRAM 52 icram_dout 64 80 dtg_stat_out 5 108 pj_data_in 32















Initialization and warm−up period
IAS
RTL




















Figure 2. Fault injection and observation timing diagram.
ory arrays, or only one of them. When the fault is injected
into the FIL, its function determines the type of soft error
being generated such as a control error, data error, etc.
Four application programs implementing different algo-
rithms were used to benchmark workloads.Bubble imple-
ments an elementary bubble sort method for sorting 64 in-
teger numbers. The algorithm involves comparing and ex-
changing elements to properly position in a data structure.
Crypt performs encryption and decryption using IDEA (In-
ternational Data Encryption Algorithm) on an array ofN
bytes. This Java kernel program is a part of the Java Grande
Forum Benchmark Suite [34].N was chosen to be 800 in
our study. We also used an optimized Java version of the
well knownLinpack benchmark.Linpack solves anN N
linear system using LU factorization followed by a triangu-
lar solve. It measures floating point performance and is nu-
merically intensive. Our problem size was25 25. Queens
finds all the solutions of the Eight Queens problem on an
88 Chess-board using a recursive algorithm. We believe
these programs are representative of frequently used oper-
ations in the applications of the picoJava-II processor. All
benchmarks were tested on both the picoJava-II and other
JVM platforms for a cross-check.
To obtain statistically significant results, a large number
of fault cases must be considered. Thus, minimizing simu-
lation time per injection was necessary for us. With more
simulation control and a slightly lowered level of observa-
tion detail, several faults can be examined during a single
execution. Figure 2 illustrates how we performed the fault
injection and made observation in a targeted FIL on the
picoJava-II. At the beginning of each simulation run, the
RTL simulator invokes another processor simulator, called
the Instruction Accurate Simulator (IAS), which models all
functionality of the picoJava-II by instruction boundaries.
The architectural states of both simulators are identical af-
ter the commitment of every instruction under the fault-
free condition. The primary purpose of the IAS is to serve
as a golden run for monitoring the RTL simulation when
the fault is active. Fault injection points (FIPs) at which
faults are injected into the processor are considered after the
initialization and warm-up period. The FIP1 is randomly
picked from a window of 50 clock cycles starting from the
first basepoint (BP1). At the FIP1, the RTL simulation is
checkpointed, i.e., saving the current state of the simulated
processor, during the first half of the clock period. When
the clock goes to low, the fault injection begins and lasts
for a predefined interval, denoted by fault injection duration
(FID). A single clock cycle may be enough for the injected
fault to be latched as a soft error. The longer the fault is
resident in the FIL, the more likely it is to affect the proces-
sor operation. Given this behavior we chose 1 and 10 clock
cycles for short and long FIDs, respectively.
After the injection at FIP1, the simulation is monitored
up to the next base point (BP2), and then the architectural
state of the RTL simulator is compared with the state of
the IAS. If the fault has been effective, the states mismatch.
In this case, the RTL simulator rolls back to FIP1 using
the checkpointed state and the IAS is paused until the RTL
reaches BP2 with no fault injection. Otherwise both simu-
lators proceed to FIP2. The same procedure is repeated for
subsequent fault injections. The more FIPs tested for a FIL,
the higher accuracy is achieved in estimation. We chose 100
FIPs during a benchmark run for a single FIL.
In our simulation, every single fault injection created a
soft error in the FIL. The effects of different occurrence tim-
ings of soft errors were examined. We definesoft error sen-
sitivity (SES) of a FIL as the probability that a soft error in
the FIL causes the processor to behave erroneously or have
an incorrect architectural state. The outcome of each fault
injection at a FIP is classified into four cases: 1) no harmful
impacts are made to the processor; 2) the program running
on the processor hangs; 3) the system crashes before the
simulation reaches the next base point for the state compar-
ison; or 4) the simulation continues to the base point, but





faults in Case i=number of faults. We
measured the sensitivities of all 109 FILs.
It is important to use a sufficiently large base interval so
that the soft error has enough time to manifest as a notice-
able failure. We set the base interval to 600 clock cycles.
It should be noted that the soft error may not generate any
failure or may be overwritten. Other metrics such as error
latency and propagation pattern are helpful in fault char-
acteristics analysis. However, we focused only on the SES
collection and analysis. It still enables us to obtain a reason-
able but important level of understanding of the picoJava-II.
5. Soft Error Sensitivity of the PicoJava-II
In this section, we present and analyze the SES results
for four workloads. The results illustrated together are
based on the common simulation parameters unless other-
wise specified. We repeated our experiments enough num-
ber of times to achieve high confidence in our simulation.
The error of our SES estimate is less than 0.098 for a 95%
confidence level. The architectural significance of FILs are
reviewed and their relations with the SES are considered.
We also point out what we can benefit from the SES in-
formation in devising an integrity checking scheme for the
picoJava-II, and evaluate how well the techniques used in
current microprocessors reflect the soft error behavior.
Integer unit: Figure 3 depicts the distribution of effec-
tive faults in FILs of the IU. The FID is 1 clock cycle. The
most apparent characteristic in this figure is that only a few
FILs are highly sensitive to the faults. There are many FILs
whose corrupted output signal values do not result in any
erroneous behavior of the processor. We also note that sen-
sitive FILs are mostly common for different programs and
the types of impacts are similar. Clearly, a faulty signal is
effective only if it plays an active role in the processor’s op-
eration. If the signal is not stored, the window when the
processor is vulnerable to the fault is only 1 clock period.
Even if the signal can propagate to other components, there
is still a chance that the error is overwritten before it creates
any failure. The results indicate that many sub-blocks in the
IU infrequently become a critical part of the operation.
Less than 10 sub-blocks overall are considerably suscep-
tible. FIL 1 produces the address of the I-cache and D-
cache. An incorrect address here means that memory ac-
cesses a wrong location. On a write, the memory state may
become faulty. On a read, unexpected data or instructions
may be delivered to the processor. Consequently, a soft er-
ror in this sub-block is critical. However, it is only occa-
sionally true because the cache memories are not always
accessed. The erroneous write may occur at an invalid loca-
tion and never get used. Although it is not very likely, data
items or even instructions from multiple locations may be
the same. Thus, reading from any of these locations is iden-
tical. In short, there are many conditions that can stop or
trigger the malicious actions of the soft error in FIL 1. We
have observed that the cache address error causes system
crashes when it is effective.
Memory address checking is performed to some extent
in conventional processors. The memory management units
checks the address range of each access, yet this cannot de-
tect an error within an accessible region. The address gener-
ator, i.e., an adder, can be protected with parity prediction,
or a similar kind of code. Duplication may be too costly
for low-end systems. The frequency of the cache access de-
pends on the program, and it may increase in superscalar
processors. In any case, relative importance of this sub-
block is high, and thus, it is worth covering.
FIL 3 is a part of branch logic informing the ICU and the
pipeline that a branch should occur after the instruction in
the execution stage. When it is faulty, the control flow of
program may change. It can affect timing of the pipeline
as well. As soon as the execution deviates from the correct
flow, the processor state becomes erroneous, which can lead
the processor to crash within the base interval. As a result,
the SES goes up to 0.86. This sub-block is a critical com-
ponent of the IU. FIL 6 is another critical part as it can af-
fect the program counter and instruction sent to the I-buffer.
Since this sub-block is also concerned with the control flow,
its erroneous behavior is similar to that of FIL 3.
Check code-based protection cannot be easily applied to
protect random logic in FUBs such as FIL 3 and FIL 6. Un-
like memory arrays, the relation between input and output
signals dynamically vary. A simple approach is to duplicate
the logic. The replication can be considered at any level
of logic block hierarchy. FUB-level duplication is the sim-
plest, but area overhead is large. Critical sub-blocks can be
selectively duplicated to lower the overhead.
An error in FIL 8 or FIL 9 can result in an unexpected ter-
mination or continuation of the D-cache access instruction
and its operation. FIL 12 controls the size, type, cacheabil-
ity, and endianness of the D-cache access. FIL 13 signals
the D-cache to fill a particular cache line with 0’s. All these
sub-blocks are critical only when the D-cache is in action.
Among the IU sub-blocks, FIL 15 is the most susceptible
to the error. This is because the earliest entry of the SC is
always corrupted if its output signal is toggled. If this hap-
pens, the process is very likely to be suspended. It should
be noted that the SES of FIL 15 forbubble is 1. This is a
first-protect portion of the processor. FIL 16 has similar im-
pacts as it is responsible for the top of the stack pointer.












                             





























                             





























                             





























                             

















4XHHQV  ),' FORFN
Figure 3. The SES of the integer unit (IU) when FID = 1.
low power management. The last critical sub-block FIL 24
may assert a false signal to the PCSU, which eventually puts
the I-cache and D-cache in standby mode. In this case, the
system crashes immediately. Converting standby mode to














Figure 4. Possible logical positions of a FIL.
So far we have inspected relatively critical sub-blocks in
the IU. Their functions are closely connected with essential
steps of instruction executions. They are frequently used or
placed in a logical position such that their malfunctions can
often be realized as a failure. Figure 4 illustrates two logical
positions that a FIL can be situated with other sub-blocks in
the processor. The OR and AND are not binary logic gates,
but rather conceptual blocks that may consist of several pro-
cessor components. Figure 4(a) is the case where the pro-
cessor behavior is directly governed by the FIL independent
of other sub-blocks. It has its own impact on the system be-
havior. On the other hand, Figure 4(b) represents the case
where the FIL is effective only with the assistance of other
sub-blocks. Unless all the related sub-blocks are active, the
faulty FIL has no implication. FIL 15 is an example of the
first case. If a FIL falls in the second case, its SES depends
on the active period of the other related sub-blocks.
In our experiment, there are many IU FILs that seemed
unaffected by the fault injection for both FIDs tested (1 and
10). This can be explained by the functionalities of the sub-
blocks. Non-critical sub-blocks are generally involved in
infrequently occurring operations and their logical positions
are mostly as is the case in Figure 4(b). For example, erro-
neous FIL 14 appears as a failure only if FIL 15 asserts a
write enable signal, which is a rare event.
Although some FILs can be the case of Figure 4(a), their
errors may not be serious. An example is that faulty FIL
4 flushes or invalidates an I-cache line, but this only pro-
duces an additional memory access. However, not flushing
an I-cache line related with self-modifying code support or
flushing a dirty line in the D-cache (FIL 11) can alter the
processor state. FIL 30 can stall the FPU for extra cycles,
but the processor maintains the computation integrity. Sim-
ilar reasonings can be made for other sub-blocks.
Many FILs never exhibit any effective fault in our re-
sults, but it does not mean that they are not required com-
ponents. It simply shows that randomly selected erroneous
periods of the FILs did not overlap with their active cycles
or their faults have nothing to do with the correctness of the
computation. When the FID was increased to 10 clocks,
more effective faults were observed. Figure 5 presents the
SES results in that case. A FID of 10 is an extreme exam-
ple modeling a strong noise hit that prolongs the circuit’s
recovery time. FILs 4, 10 and 28 start to show their sus-
ceptibility. Interestingly in FIL 15 forLinpack, the SES is
slightly lower than the case with the FID of 1. A possible
explanation of this is that a faulty signal at an earlier cycle
corrupted the first entry of the SC, and then a faulty signal
at a subsequent cycle unintentionally corrected it back to
normal. Different FIDs change the type of failure in a small
degree. In order to check if a further increase in the FID
makes a difference in non-critical sub-blocks, we injected
them with long-duration faults of 500-clocks forC ypt and
monitored using a 800-clock base interval. Figure 6 de-
picts the sensitivities of non-critical sub-blocks in all FUBs.
Many sub-blocks become somewhat sensitive.
When considering a protection plan for the IU, our SES
results can serve as a standard for ranking the sub-blocks in
order of importance. We have found that sub-blocks with
two opposite characteristics,highly susceptible andseldom
susceptible to the soft error, do not vary with workloads.
Moreover, this situation occurs for faults with a longer FID
as well. From our data, we believe that checking the top
few critical sub-blocks alone enhances the integrity of the












                             





























                             





























                             





























                             

















4XHHQV  ),' FORFNV




































































































&U\SW  ),' FORFNV
Figure 6. Non-critical sub-blocks (FID = 500).
in cache memory, which contains the most frequently used
data and covers most of the memory access requests. De-
pending on the logic type of the sub-block, diverse fault
handling techniques can be employed. If we take into ac-
count the design and verification complexity, duplication is
a fair choice for random logic. We claim that the protection
method based on the SES is very efficient to maximize the
fault coverage with limited resources.
Instruction cache: Figure 7 shows the SES data mea-
sured for the I-cache. For a FIL, the two columns represent
the SES when FID is 1 and 10 clocks, respectively. After
observing similar patterns of the SES over different work-
loads, only the maximum SES of the four is plotted for each
column. Critical sub-blocks are fewer than the IU. FIL 31
represents instructions stored into the top 7 bytes of the I-
buffer. Its error results in the execution of incorrect instruc-
tions, but the instructions are effective only when their valid
bits are set. In addition, they are removed on a branch or
trap, rendering them inactive. Even though this sub-block is
a memory array, we first injected the faults for a single clock
to model transient faults in its random logic portion. If the
faults occur in the memory cells, the data error stays until
overwritten. This situation was moderately observed with a
FID of 10. As expected, the SES of FIL 31 greatly increases
with a longer fault duration. Errors in FIL 32 cause instruc-
tion state transitions. FIL 38 is critical because it allows
instructions in the ICRAM to be modified. FIL 52 is ac-
tive when its data bytes are transferred to the I-buffer. Any
corruption in the I-buffer is nullified by a flush.
Combining Figures 6 and 7, distinctively non-critical
sub-blocks of the I-cache are FILs 41, 42, 44, 49, and 50.
FILs 41 and 42 are effective only when FIL 43 is active.
This models the case in Figure 4(b). If their errors propa-
gate to the ITAG, they are more likely to appear as a false I-
cache miss, in which the processor fetches instructions from
the memory instead. The situation is similar for FIL 49 and
FIL 50, and therefore, they become hardly sensitive to the
faults. The PCSU shuts off the clock for standby mode only
if both the ICU and DCU signal that they are ready. A false
assertion by FIL 44 alone is automatically masked.
For the I-cache protection, instruction memory arrays
such as the I-buffer and ICRAM should be considered first.
Also, control logic for them needs to be covered before the
ITAG and its control logic. The tag portion of the I-cache is
much less error-sensitive than the data portion. In conven-
tional processors, memory arrays for both tag and data are
often protected with ECCs, but surrounding control logic is
left unprotected. Control logic errors have a short lifetime
unless they are propagated to a storage component, whereas
data errors in memory arrays remain active for a longer pe-
riod. In this context, it is logical to protect memory arrays.
However, highly susceptible control logic is as critical as
instruction memories.
Data cache: The D-cache handles the processor’s data
write requests as well as reads. Accordingly, it requires
more controls and functions with diverse soft error char-
acteristics. In our results, SES levels of the D-cache are
higher forBubble andQueens than other programs because
their operations are memory-intensive. However, all SES
increases appear only in those sub-blocks that are error-
sensitive for the other programs. Figure 8 is the maximum
SES chart for the D-cache. More FILs seem to be sensitive
here than in the I-cache. The SES level of the D-cache is
directly proportional to its access frequency.












31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52















Max (Bubble, Crypt, Linpack, Queens)
ICRAM












53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83















Max (Bubble, Crypt, Linpack, Queens)
DCRAM
Figure 8. The SES of the D-cache.
neous transition fromvalid to invalid simply makes the pro-
cessor wait for valid data, whereas a transition in the op-
posite direction lets unexpected data to be delivered to the
IU or SMU. A corrupted data transfer also occurs if FIL
54 or FIL 83 fails. FIL 54 aligns data items read from the
DCRAM (FIL 83) and puts them on the bus. Since the func-
tions of FIL 54 and FIL 83 are performed in a sequence,
their SES levels and patterns are very similar. FIL 56 is the
most sensitive sub-block as it signals synchronization error,
I/O error, and/or memory error in data access. Like FIL
38 in the I-cache, FIL 59 may modify data in the DCRAM
with a false write enable signal. Other write enable signals
are produced by FIL 68 for tags and FIL 69 for status bits.
FIL 60 acts as asleep signal to the DCRAM and DTAG.
Improperly disabling the D-cache results in a process fail-
ure. FILs 61 and 70 are the address for the D-cache access.
An address error may be resolved as an access miss or other
situations as we discussed with FIL 1. Thus, their sensitiv-
ities are close to that of FIL 1. FIL 63 controls bypassing
data path for non-cacheable loads and stores. If it selects a
wrong path, a memory transaction error occurs.
A DTAG error creates an incorrect hit or miss decision.
Miss rate on the D-cache is relatively high, and therefore,
more false hits (erroneous decision changes from miss to
hit) may occur than in the I-cache. Moreover, a false miss
for a dirty line makes a stale data delivery from the memory.
This is why the DTAG exhibits some error susceptibilities
unlike the ITAG. A bit-toggle at the outputs of FIL 81 or
82 guarantees a mis-decision. On the other hand, faulty FIL
79 still has a possibility of avoiding false hits because the
output signal is 19 bits. We need to note that how often a
fault in a FIL eventually corrupts its output signal is a differ-
ent issue from how the output error affects the computation.
The former is mainly related with the upset rate of the FIL
and the latter is the SES that we investigate here. In reality,
the upset rates for FIL 79 and FIL 81 might be similar.
When FID was increased to 500 for testing rarely sensi-
tive sub-blocks, faults were effective in FILs 64, 71, and 72,
but not in FILs 57 and 73 as shown in Figure 6. The reason
for FIL 57 is the same as FIL 44 of the ICU. FIL 73 notifies
whether or not a store on an SC write miss is completed in
the D-cache. The SC miss is a rare event. A false signaling
in one direction induces more wait cycles and the signaling
in the other direction is not very prone to a failure.
The general protection scheme for the D-cache can be
the same as the I-cache in that memory arrays are checked
with an ECC and at least critical control sub-blocks are cov-
ered. For a write-back D-cache as employed in picoJava-II,
up-to-date data items may be available only in the cache.
Therefore, more integrity checking needs to considered for
the D-cache than the I-cache. Write through policy is most
commonly used when parity protection is used. In the case
of error detection in a cache line, simply invalidating the
line accomplishes error recovery.
FPU, SMU, PCSU, BIU, and MEMC: In Figure
9, the maximum SES results for the remaining FUBs are
presented where FID are 1 and 10. Apparently, the more
floating point operations, the higher SES. However, the FPU
here shows very low sensitivities. The reason for that is
as follows. FIL 84 is another example of case in Figure
4(b) in asserting the start and end signal of floating point
operations. The output of FIL 85 is valid only for one or
two cycles on completion of a long execution (up to 2000
cycles) of a floating-point instruction. It is quite unlikely for












84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109















Max (Bubble, Crypt, Linpack, Queens)
MEMCSMUFPU
Figure 9. The SES of the remaining FUBs.
the SES of the FPU in the worst situation, we could have
injected the faults when the output is active. In that case,
the SES would be much higher. Separate fault injections
into smaller logic blocks of FIL 85 are needed to understand
error manifestation behavior at its output.
The SMU shows zero SES in all FILs except FIL 92.
It handles spills and fills of the SC and manages execution
pipeline in overflow and underflow conditions. An over-
flow occurs if the top of the stack pointer is smaller than
the bottom of stack pointer. In comparison to other oper-
ations, the overflow frequency is very low. An underflow
can be activated only in response to changes at the top of
the stack pointer initiated by the return instructions, which
seldom occur. For these reasons, faults are hardly effective
in the SMU. FIPs that are randomly selected in our test did
not coincide with the occasional SMU active cycles.
Even if FID is 500 (Figure 6), FILs 90, 91, 94, and 96
have no changes in the SES. When FID 90 asserts a write
enable, the output of FIL 89 is latched. Since the state of
FIL 89 does not vary often, additional writes by erroneous
FID 90 have usually no special impact. FIL 91 generates
pipeline stalls. Holding the pipeline for extra cycles intro-
duces an operation delay, but not a process failure. FIL 94
requests loads solely on an SC fill or an underflow. FIL
96 tells the DCU that a data request is a non-allocate store.
This has no effect in the case of a D-cache hit, but on a
miss, data item is directly written to the memory. Erro-
neous switching by FIL 96 between caching or non-caching
data cannot alter the correctness of computation unless it
caches data from a non-cacheable address region. Based on
the observation, the SMU can be the last FUB to consider
for integrity checking. Since the SMU mainly consists of
random logic, the lifespan of errors in the SMU caused by
temporary hardware failures is short.
FIL 97 requests the IU to take a trap when there is an in-
terrupt by an external device. The trap is however not taken
when the interrupt mask bit of the processor state register is
0, making faults ineffective. FIL 99 signals a power-down
to other FUBs. Unless the IU executes a power-down in-
struction, indicated by FIL 24, no response is made. Ac-
cordingly, its SES is zero. Erroneously triggering a standby
when the FUBs are not ready leads the process to fail,
whereas not going into the standby mode even at a right
timing does not affect normal execution. Considering this
unique characteristic, the power-down control logic needs a
protection from the former case only. The PCSU is placed
in a fault tolerant position like Figure 4(b).
As clearly seen in the figure, handshaking signals by
FIL 100, 101, and 109 are error-sensitive during memory
transactions between the BIU and the caches or the BIU
and MEMC. A false bus acknowledgment causes the pro-
cessor to hang or crash. Other FILs except FIL 107 are
less sensitive because their faulty signals neither initiate nor
terminate bus transactions. Data errors directly propagate
through memory transactions. Thus, ensuring the integrity
of handshaking activity along with data check is a critical
component, qualifying for protection redundancy.
FUB-level SES: Figure 10 visualizes our SES esti-
mate of the picoJava-II. The average SES of each FUB is
quantized into one of 10 shade levels after weighting the ar-
eas and sensitivities of FILs in the FUB. A higher SES is

















































Figure 10. Estimated SES visualization.
6. Summary
We have conducted fault injection simulations on a com-
mercial product, picoJava-II, to characterize the soft error
sensitivities of its components. We summarize our anal-
ysis as follows: (1) Many processor components are sel-
dom susceptible to soft errors, while there are a few com-
ponents with a very high SES. The SES of a component is
determined by its architectural function; logical situation,
governing processor behavior directly or in collaboration
with other components; and active cycle/frequency. A logic
block may have an inherent capability of fault masking to
some degree. (2) Workload variation does not convert a
critical component to a non-critical component in terms of
SES, or vice versa. Although minor changes in the SES
level and the pattern of failures can be induced by different
workloads, clear distinction between the protection order,
from first to last, among components remains unchanged.
As a result, the SES is an effective standard for prioritiz-
ing protection and the integrity checking of various proces-
sor components to enhance dependability. (3) Soft errors
in control logic generally have a shorter lifetime than those
in the memory arrays. Controlling program flow and bus
transaction is more effective than other controls. There-
fore, protection can start with memory, which is already
common in many microprocessors, and then include criti-
cal random logic. (4) The sensitivities of many components
are fairly predictable from processor architecture and orga-
nization. The amount of efforts to estimate the SES data is
adjustable with varying degree of accuracy. Different levels
in processor model hierarchy can be selected for the esti-
mation. This enables the designer to meet time-to-market
requirements, while achieving a resource-efficient increase
in processor dependability.
Our study demonstrates that fault injection in RTL mod-
els is a very effective method to improve the efficiency of
the protection techniques in an early development stage.
Acknowledgments: This work was funded in part by
Carver Trust Grants from Iowa State University. The au-
thors would like to thank Jon E. Froehlich for verifying our
simulator and benchmark programs.
References
[1] J. Sosnowski, “Transient fault tolerance in digital systems,”
IEEE Micro, 14(1):24-35, Feb. 1994.
[2] D. Siewiorek and R. Swartz,Reliable Computer Systems:
Design and evaluation, A K Peters, Natick, MA, 1998.
[3] S. Kim and A. K. somani, “Area efficient architectures for
information integrity in cache memories,”ISCA, pp. 246-
255, 1999.
[4] J. Clark and D. Pradhan, “Fault injection: a method for val-
idating computer-system dependability,”IEEE Computer,
28(6):47-56, June 1995.
[5] M. Hsueh, T. Tsai, and R. Iyer, “Fault injection techniques
and tools,”IEEE Computer, 30(4):75-82, April 1997.
[6] S. Chau, “Fault injection boundary scan design for verifica-
tion of fault tolerant systems,”Proc. ITC, pp. 667-682, 1994.
[7] C. Constantinescu, “Teraflops supercomputer: architecture
and validation of the fault tolerance mechanisms,”IEEE
Trans. Computers, 49(9):886-894, Sept. 2000.
[8] J. Karlsson et al., “Using heavy-ion radiation to validate
fault-handling mechanisms,”IEEE Micro, 14(1):8-23, 1994.
[9] G. Miremadi and J. Torin, “Evaluating processor-behavior
and three error-detection mechanisms using physical fault-
injection,” IEEE Trans. Reliability, 44(3):441-454, 1995.
[10] U. Gunneflo, J. Karlsson, and J. Torin, “Evaluation of error
detection schemes using fault injection by heavy-ion radia-
tion,” Proc. FTCS-19, pp. 340-347, 1989.
[11] G. Miremadi et al., “Two software techniques for on-line
error detection,”Proc. FTCS-22, pp. 328-335, 1992.
[12] W. Moreno et al, “First test results of system level fault tol-
erant design validation through laser fault injection,”ICCD,
pp. 544-548, 1997.
[13] J. Samson et al, “A technique for automated validation of
fault tolerant designs using laser fault injection (LFI),”Proc.
FTCS-28, pp. 162-167, 1998.
[14] G. Kanawati et al., “FERRARI: a flexible software-based
fault and error injection system,”IEEE Trans. Computers,
44(2):248-260, Feb., 1995.
[15] P. Cheynet et al., “Experimentally evaluating an automatic
approach for generating safety-critical software with re-
spect to transient errors,”IEEE Trans. Nuclear Science,
47(6):2231-2236, Dec., 2000.
[16] J. Aidemark et al., “GOOFI: generic object-oriented fault
injection tool,” Proc. Conf. Dependable Systems and Net-
works, pp. 83-88, 2001.
[17] V. Sieh et al., “VERIFY: evaluation of reliability using
VHDL-models with embedded fault descriptions,”Proc.
FTCS-97, pp. 32-36, 1997.
[18] E. W. Czeck and D. P. Siewiorek, “Observations on the ef-
fects of fault manifestation as a function of workload,”IEEE
Trans. Computers, 41(5):559-566, 1992.
[19] S. Kim and A. K. Somani, “On-line integrity monitoring of
microprocessor control logic,”ICCD, pp. 314-319, 2001.
[20] P. Duba and R. Iyer, “Transient fault behavior in a
microprocessor-a case study,”ICCD, pp. 272-276, 1988.
[21] E. W. Czeck and D. P. Siewiorek, “Effects of transient gate-
level faults on program behavior,”FTCS, pp. 236-243, 1990.
[22] M. Rimen and J. Ohlsson, “A study of the error behavior of a
32-bit RISC subjected to simulated transient fault injection,”
Proc. Int’l Test Conf., pp. 696-704, 1992.
[23] G. Choi at el., “Fault behavior dictionary for simulation of
device-level transients,”Proc. Conf. Computer-Aided De-
sign, pp. 6-9, 1993.
[24] N. Kanawati et al., “Dependability evaluation using hybrid
fault/error injection,” Proc. Symp. Computer Performance
and Dependability, pp. 224-233, 1995.
[25] J. Guthoff and V. Sieh, “Combining software-implemented
and simulation-based fault injection into a single fault injec-
tion method,”Proc. FTCS-25, pp. 196-206, 1995.
[26] E. Jenn et al., “Fault injection into VHDL models: the
MEFISTO tool,”Proc. FTCS-24, pp. 66-75, 1994.
[27] A. Amendola et al., “Fault behavior observation of a micro-
processor system through a VHDL simulation-based fault
injection experiment,”Proc. Conf. EURO Design Automa-
tion, pp. 536-541, 1996.
[28] D. Gil et al., “Fault injection into VHDL models: analysis
of the error syndrome of a microcomputer system,”Proc.
FTCS-28, pp. 418-424, 1998.
[29] J. Gaisler, “Evaluation of a 32-bit microprocessor wit built-
in concurrent error-detection,”FTCS-27, pp. 42-46, 1997.
[30] C. Kouba and G. Choi, “The single event upset character-
istics of the 486-DX4 microprocessor,”Proc. Radiation Ef-
fects Data Workshop, pp. 48-52, 1997.
[31] H. McGhan and M. O’Connor, “PicoJava: a direct execution
engine for Java bytecode,”IEEE Computer, 32(10):22-30,
Oct. 1998.
[32] “Community source licensing,” Sun Microsystems,
http://www.sun.com/processors/communitysource.
[33] S. Dey et al., “Using a soft core in a SoC design: experiences
with picoJava,”IEEE Design & Test of computers, 17(3):60-
71, July-Sept. 2000.
[34] “The java grande forum benchmark suite,” EPCC,
http://www.epcc.ed.ac.uk/javagrande/javag.html.
View publication stats
