Abstract-With the scaling of complementary metal-oxide-semiconductor (CMOS) technology to the submicron range, designers have to deal with a growing number and variety of fault types. In this way, intermittent faults are gaining importance in modern very large scale integration (VLSI) circuits. The presence of these faults is increasing due to the complexity of manufacturing processes (which produce residues and parameter variations), together with special aging mechanisms. This work presents a case study of the impact of intermittent faults on the behavior of a reduced instruction set computing (RISC) microprocessor. We have carried out an exhaustive reliability assessment by using very-high-speed-integrated-circuit hardware description language (VHDL)-based fault injection. In this way, we have been able to modify different intermittent fault parameters, to select various targets, and even, to compare the impact of intermittent faults with those induced by transient and permanent faults. 
Effects of Intermittent Faults on the Reliability of a Reduced Instruction Set Computing (RISC) Microprocessor
Joaquín Gracia-Morán, J. Carlos Baraza-Calvo, Daniel Gil-Tomás, Luis J. Saiz-Adalid, and Pedro J. Gil-Vicente, Member, IEEE Abstract-With the scaling of complementary metal-oxide-semiconductor (CMOS) technology to the submicron range, designers have to deal with a growing number and variety of fault types. In this way, intermittent faults are gaining importance in modern very large scale integration (VLSI) circuits. The presence of these faults is increasing due to the complexity of manufacturing processes (which produce residues and parameter variations), together with special aging mechanisms. This work presents a case study of the impact of intermittent faults on the behavior of a reduced instruction set computing (RISC) microprocessor. We have carried out an exhaustive reliability assessment by using very-high-speed-integrated-circuit hardware description language (VHDL)-based fault injection. In this way, we have been able to modify different intermittent fault parameters, to select various targets, and even, to compare the impact of intermittent faults with those induced by transient and permanent faults.
Index Terms-Fault injection, hardware description languages, integrated circuit reliability, intermittent faults, reduced instruction set computing (RISC) microprocessor. 
ACRONYMS AND ABBREVIATIONS

IC
I
N RECENT years, the reduction of transistors size has allowed the increase of microprocessors speed and the decrease of their size and supply voltage, but at the cost of augmenting the incidence of faults [1] , [2] . This reduction causes a 0018-9529 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
higher rate of transient faults, commonly provoked by temporary environmental conditions such as electromagnetic interference, or cosmic or internal radiation. Even, radiation may now affect multiple locations. Also, changes in the manufacturing processes have increased the rate of permanent faults. This type of fault is produced by irreversible physical changes in a chip. Recently, intermittent faults have emerged as a new source of trouble in deep submicron integrated circuits [3] , [4] . Historically, intermittent faults were considered as a prelude to permanent faults. Wear out processes of an integrated circuit (IC) usually provoke permanent faults, which initially manifest intermittently. Nevertheless, the introduction of new deep submicron technologies makes it necessary to study new causes and mechanisms of intermittent faults.
Over the last few years, the effects of intermittent faults in real systems have been analyzed. The failures produced were monitored to determine the most frequent sources of errors and their manifestation [5] - [10] . However, the long observation time necessary to perform this type of study suggests the use of new techniques to accelerate the fault occurrence.
Fault injection is a common method to assess the reliability of computer systems [11] - [15] . This technique allows a controlled introduction of faults in the system. In this way, the time until the appearance of real faults in the system is shortened. Fault injection techniques can be classified into three main categories [16] : physical (or Hardware Implemented Fault Injection (HWIFI)), software implemented (SWIFI), and simulation-based (SBFI).
Simulation-based fault injection has proven to be a good technique to study the impact of intermittent faults [17] - [20] . It is a useful experimental way to evaluate the dependability of a system during the design phase. An early diagnosis allows saving costs in the design process, avoiding redesigning in case of error, and thus reducing time-to-market.
Two important issues when using simulation-based fault injection are the accuracy of the system model, and the representativeness of the fault models. Regarding this last question, in previous works we have studied some representative causes and mechanisms related to intermittent faults. From this study, we have generated a set of intermittent fault models at logic and register transfer (RT) abstraction levels which can be injected into very-high-speed-integrated-circuit hardware description language (VHDL) models [21] .
The objective of this work is twofold: i) to study the impact of intermittent faults in a reduced instruction set computing (RISC) microprocessor, and ii) to compare the consequences of intermittent faults with the effects caused by permanent and transient faults. To carry out the fault injection experiments, we have used VHDL-based Fault Injection due to its flexibility, as well as high observability and controllability of the model components [16] . This paper complements previous works published by the authors [18] , [21] , where the impact of intermittent faults on a commercial complex instruction set computing (CISC) microcontroller was analyzed.
The paper is organized as follows. Section II describes the fault injection environment, including a summary of the intermittent fault models applied, a brief description of the intermittent fault parameters, and an outline of the VHDL-based fault injection techniques. Section III depicts the fault injection experiments. Section IV includes a selection of the results. Finally, Section V provides some conclusions.
II. FAULT INJECTION ENVIRONMENT
A. Intermittent Fault Models
Transient and permanent fault models have been traditionally well established, whereas modeling intermittent faults is a pending issue [3] . The most popular fault model for permanent faults is stuck-at, while bit-flip is normally used for transient faults [16] .
Intermittent faults occur due to unstable or marginal hardware. They can be activated by an environmental change such as temperature or voltage alterations. Manufacturing residues, process variations, and special wear out processes can also lead to such faults. The introduction of new deep submicron technologies makes it necessary to study new fault causes and mechanisms of intermittent faults. Table I summarizes some representative physical causes and fault mechanisms of intermittent faults, as well as the fault models proposed in every case [21] . The table tries to unify, classify, and relate the different fault sources. It shows intermittent fault models for buses, storage elements, input/output connections and combinational logic. These fault models are defined at logic and RT abstraction levels. More information can be found in [18] , [21] .
B. Intermittent Fault Parameters
Intermittent faults manifest as occasional bursts that typically repeat from time to time, and whose effects are not continuous. Also, intermittent faults occur repeatedly in the same places [8] . The duration of intermittent faults is not constant. Instead, it depends on some variable aspects like manufacturing process, environment, the wear out process, etc. In this way, the number of times that the fault is active during a burst, as well as the duration of each activation, and the separation between activations, have been defined as parameters of the intermittent fault models [22] . Fig. 1 explains the burst parameters. Fig. 2 shows the classification of the different VHDL-based fault injection techniques [16] .
C. Fault Injection Techniques
With simulator commands, it is possible to change, at simulation time, the value or the timing of the signals and variables of the system. Saboteurs and mutants modify the VHDL code of the system by inserting injection components (saboteurs) or activating mutated versions of the existing components (mutants). Although these two techniques are more complex to apply, and introduce more spatial and temporal overhead than simulator commands, they allow injecting more complex fault models. Other techniques extend the syntax and semantics of the VHDL language.
We have injected faults using a tool developed by our research group called VFIT (VHDL-based Fault Injection Tool) [16] . VFIT is able to inject faults automatically applying simulator commands, saboteurs, and mutants techniques. The different injection experiments presented in this paper have been carried out with simulator commands, as all fault models selected can be injected by this technique, and their application implies lower temporal and spatial overhead.
III. EXPERIMENTS DESCRIPTIONS
The main purpose of the study presented in this work is to analyze the influence of intermittent faults in the behavior of a RISC microprocessor. The system target is the Plasma microprocessor [23] . It has a 32-bit microprocessor without interlocked pipeline stages (MIPS) architecture with a four-stage pipeline. The VHDL model of Plasma is described at RT and logic abstraction levels.
To exercise the main elements of the microprocessor (memory, registers, buses, arithmetic and logic unit (ALU) and control unit (CU)), the bubblesort sorting algorithm has been used. In this way, we have injected intermittent faults into the storage elements (the register bank, and the random-access memory (RAM)), the buses, and the combinational logic of the ALU and CU. Fig. 3 shows the structure of the Plasma core, and the injection targets.
The eight main injection parameters are next explained.
A. Fault Multiplicity
Due to technology scaling, intermittent faults will likely affect multiple locations [7] . These multiple locations may be adjacent (i.e., neighbor cells in register and memory, neighbor wires in a bus, etc.), or non-adjacent. Thus, we have injected both single, and multiple faults.
During the configuration phase of each experiment, we have considered two aspects.
• The number of faults in non-adjacent locations is the first aspect. We have generated the number of non-adjacent targets using a uniform distribution function in the range , where is the total number of non-adjacent locations.
• In multiple-bit targets, the number of adjacent locations is the second aspect. In this case, we have applied a uniform distribution function in the range , where is the number of adjacent locations.
B. Fault Types
Intermittent, transient, and permanent faults have been injected.
C. Fault Models
According to Section II.A, three intermittent fault models have been injected:
• [16] , [24] has been injected in combinational targets.
For permanent faults, we have injected stuck-at(0,1), open, and indetermination fault models [16] , [24] .
We have not injected time-related faults (such as the intermittent delay fault model, see Table I ) due to the lack of temporal specifications in the VHDL model. This approach is typical in high level system models, as delays are introduced in the implementation phase, after place and route.
D. Burst Parameters (For Intermittent Faults)
As mentioned in Section II.B, intermittent faults manifest in bursts. Three parameters must be configured to inject them (see Fig. 1 ):
• the burst length ( ), • the activity time ( ), and • the inactivity time ( ). We have generated the values of these three parameters using uniform random distribution functions. ). follows a discrete uniform distribution in the range [1] , [10] .
E. Fault Duration (For Transient Faults)
The values of this parameter have been obtained using uniform random distribution functions in three time ranges: 
F. Injection Instant
We have generated this parameter using a uniform distribution function over the workload duration. 
G. Number of Faults Injected
To obtain a reliable statistic sample, we have injected 1,000 faults per experiment, so that more than 125,000 faults have been injected in total.
H. Measures Obtained
To measure the impact of intermittent faults, we have calculated in every experiment the following percentages.
• Percentage of failures:
(1)
• Percentage of latent errors:
• Percentage of non-effective errors:
Latent errors and failures are detected by comparing the trace of every fault-injected simulation with a golden run (i.e., the trace obtained from simulating without faults). Fig. 4 summarizes the fault syndrome, and the calculated data.
IV. RESULTS
This section is divided into four parts. Section IV.A analyzes the influence of burst parameters. Section IV.B studies the influence of the injection target. Section IV.C compares the impact of intermittent faults to that of transient and permanent faults. Finally, Section IV.D compares other works where intermittent faults are injected in different microprocessors. Fig. 5 represents the impact of intermittent faults in the storage elements. Regarding single faults (Fig. 5(a) ), the percentage of failures is very low ( ), while the percentage of latent errors is high (
A. Influence of Burst Parameters 1) Influence of the Activity and Inactivity Times:
). This result is due to two reasons: i) faults in critical registers mainly provoke failures, statistically independently of the number of activations; and ii) faults in memory mostly cause latent errors because faults are injected randomly in all the memory space, and the workload occupies a very small portion of the memory. On the other hand, as the memory is much bigger than the register bank, the overall behavior tends to that of the memory.
The same trend is observed in multiple faults (Fig. 5(b) ). In this case, as expected, both percentages (failures, and latent er- rors) are higher, with values of about 8%, and 75% respectively. This behavior is predictable, as multiple faults affect simultaneously various physical locations of the system.
It is important to emphasize that, in both cases (single, and multiple faults), no significant changes are observed when varying and . That is, neither the duration of the activations nor their separation seem to affect and in storage elements. Fig. 6 shows the effects of intermittent faults in buses. As the activity time ( ) grows,
• the percentage of failures grows appreciably;
• the percentage of latent errors decreases, because faults with longer provoke failures rather than latent errors; and • in general, the system is more affected as the percentage of non-effective errors decreases. Regarding fault multiplicity, almost all multiple faults affect the system, provoking failures or latent errors (the percentage of non-effective errors is under 2%). A noticeable increment of failures is observed. For the largest , Fig. 6(b) shows values of over 90%. Last, Fig. 7 illustrates the effects of intermittent faults in the combinational logic. As in buses, augmenting provokes a clear rise in the percentage of failures. In single faults, values of and are smaller than in buses. This result is due to the masking mechanisms existing in combinational logic. As a consequence, the values of for combinational logic are larger than the ones for buses. Regarding multiple faults, the values of are similar to those obtained for the buses, whereas the values of for combinational logic are smaller than for buses, except for the longest values of . Also, as in buses and storage elements, the inactivity time does not have any influence on the results.
In general, we can observe that buses are the most sensitive targets for intermittent faults. Nevertheless, intermittent faults in combinational logic present a non-negligible impact. In multiple faults, and for large activity times, their impact can be similar to that in buses.
On the other hand, intermittent faults in the storage elements provoke mainly latent errors, with a very low percentage of failures. Unexpectedly, the activity time has no influence on the results. This result is due to both the absence of masking effects in the propagation, and the existence of a huge quantity of cells. This result happens especially in memories, where perturbed cells may not be accessed by the workload any more.
Unexpectedly, does not present a significant influence. A deeper analysis has shown that varying the separation between activations does not change the total number of activations in the bursts, because our particular workload is long enough to fit all activations. Nevertheless, in a general case, this parameter is expected to gain importance because: i) it may affect the system behavior, as it can influence the number of activations, and ii) in a fault-tolerant system, the separation between activations can affect the detection and recovery latencies.
As expected, multiple faults are much more harmful than single faults. This behavior is predictable, as multiple faults affect simultaneously various physical locations of the system. Percentages of failures over 90% can be seen for intermittent multiple faults in buses and combinational logic. Also, in these two targets, influences notably . Particularly, a roughly logarithmic dependency ( ) can be appreciated (note that the scale of is logarithmic). Fig. 8 shows the results obtained when varying from 1 to 10, with and defined randomly in the intermediate range [0.1T-1.0T] using a uniform distribution function. The figure shows the results for single and multiple faults, and for the three targets: storage elements, buses, and combinational logic.
2) Influence of the Burst Length:
As expected, multiple faults provoke more failures and latent errors than do single faults for all targets.
With respect to the storage elements (Fig. 8(a) & (b) ), presents a nearly constant behavior, with small variations between 6% and 9%. This result is due to several factors.
• Faults affecting critical registers provoke a failure in the very first activations (i.e., for lower values of ), so the total burst length does not matter.
• Faults affecting non-accessed memory cells only cause latent errors, but not failures, even in the presence of multiple activations. Regarding latent errors, results show that injecting intermittent faults (single or multiple) provokes mainly latent errors. In multiple faults, we can observe a uniform behavior, with variations of between 70% and 77%. Regarding single faults, we can observe a gap between the values of 4 and 5, with almost constant values in each interval. Anyway, the values of are lower. Briefly, does not influence much in storage elements. As faults affect directly the storage cells, errors that occurred in the very first activations remain latent.
In buses (Fig. 8(c) & (d) ), rises roughly asymptotically. We can approximate with the exponential dependency . In single faults, grows up to 39%; while in multiple faults, the percentage of failures rises up to 75%. In this case, has a clear influence on . Propagated faults augment the probability of failures because the number of activations in the same bus wires increases. From a certain number of activations, the growth is slower, and tends to stabilize. Concerning latent errors in buses, in multiple faults, decreases as increases because larger values of increase . In single faults, is almost constant. On the other hand, combinational logic ( Fig. 8(e) & (f) ) and buses behave similarly, although combinational logic presents lower values of . In single faults, grows up to about 28%, and in multiple faults it grows to 73%. In this type of target, the masking mechanisms are stronger, and thus, as a general trend, the values of are smaller than in buses. About latent errors, as increases, their percentage grows slightly in single faults, and decreases in multiple faults. In this last case, as it happened in buses, larger values of cause higher values of . This increment of provokes a decrement of .
B. Influence of the Injection Target
From the previous results, we note several results.
• Intermittent faults in buses are very harmful, as buses are used extensively in the execution of microprocessor instructions.
• Combinational logic is less sensitive, although the impact of intermittent faults can be notable for high values of and . The masking mechanisms of this type of logic reduce .
• Intermittent faults in registers provoke a high percentage of failures, because they store intermediate results when executing an instruction. Instead, faults in memory manifest mainly as latent errors. 
C. Comparison to Transient and Permanent Faults
In this section, we compare the effects of transient, intermittent, and permanent faults in all targets. In these experiments, for intermittent faults has been generated in the intermediate range Due to their physical nature, the fault duration of transient faults in the storage elements (bit-flip) has no sense, and thus has not been specified. Fig. 9 shows the results obtained in storage elements, where an apparently unexpected trend can be noticed. Transient faults provoke more failures and latent errors than intermittent faults. The reason for this result is that there is a low rate of overwrite operations in the memory cells affected by transient faults (bit-flips), due to both the memory size and the workload behavior. These faults present a de facto infinite duration, thus remaining stored permanently. Notice that the intermittent fault model injected in storage elements is the intermittent stuck-at (see Table I ). On the other hand, Fig. 10 introduces the results obtained in buses, while Fig. 11 presents the results obtained in combina- tional logic. In these graphs, the results are as expected. That is, transient faults provoke fewer failures than intermittent faults, as a burst of intermittent faults manifests like a sequence of transient faults in spite of having different origin. As commented before, due to the masking mechanisms inherent to combinational logic, it presents smaller percentages of latent errors than buses. In any case, buses are more sensitive to all fault types, as is lower than in the combinational logic. In all targets, the greatest impact corresponds to permanent faults because of their infinite duration, although similar values are obtained for the longest values of in intermittent faults. Figs. 10 and 11 also show an important dependency on the fault duration of transient faults, similar to that of intermittent faults, and the activity time.
D. Related Work
The present work completes the results presented in [18] , [21] , where the behavior of an 8051 microcontroller under the influence of intermittent faults is analyzed. Comparing these works with the results presented in this paper, both cores show similar general trends, although some differences have been observed. The Plasma microprocessor is more sensitive to intermittent faults in combinational logic. This result is due to the higher complexity of the Plasma processor in terms of combinational logic (multiplexers, multiplier-divider, and the memory controller). Also, more latent errors have been detected in the Plasma processor, specially caused by faults in the storage modules, mainly because the memory of the Plasma processor is bigger. Table II compares the effects of intermittent faults in the Plasma and 8051 cores, summarizing the impact of the different parameters studied in the previous sections. More results about the 8051 core can be seen in [22] , [25] .
In [26] , the impact of transient and intermittent faults on application programs executed in a model of a simple five-stage pipeline RISC processor is compared. The study shows that transient and intermittent faults present substantial differences in the percentage of crashes (failures) caused in programs. Also, it is verified the important influence of: i) the origin of the intermittent fault (that is to say, the injection target); and ii) the fault length (in other words, its total duration). The length of an intermittent fault includes all the fault activations. That is, the results in [26] are similar to those obtained in this paper.
On the other hand, [27] defines a new metric, called IVF (Intermittent Vulnerability Factor), to study the impact of intermittent faults in the internal blocks of microprocessors. For the injection experiments, they use a model of the Alpha 21264, a Digital Equipment Corporation (DEC) RISC microprocessor. The authors arrive to similar conclusions to those presented in this paper: longer activity times or longer bursts provoke more failures; also, faults in special registers cause a great impact on the system; and finally, intermittent faults provoke more failures in the system than transient faults.
V. CONCLUSIONS
In this work, we have presented a case study of the effects of intermittent faults on the behavior of a RISC microprocessor. The impact of intermittent faults has been also compared with those provoked by transient and permanent faults. The methodology used lies in the VHDL-based fault injection technique, which allows a systematic, exhaustive analysis of the influence of different fault parameters. From the study, some general trends can be extracted.
• The activity time is a quite important factor, as increasing the duration of the activations provokes a significant rise in the percentage of failures, especially in buses and combinational logic. A roughly logarithmic growth has been observed. The increase of the activity time is a trend in the intermittent faults caused by aging mechanisms. On the other hand, the inactivity time has not shown any significant effect because the duration of the workload was long enough to fit all activations.
• The burst length has also a notable influence. The percentage of failures grows asymptotically when increasing this parameter. The increase of the burst length is also an expected behavior in the intermittent faults provoked by aging mechanisms.
• Another important factor is the fault spatial multiplicity.
Multiple faults provoke a much greater percentage of failures than single faults. This result is an important issue because it is expected that, as the feature size of the manufacturing process reduces in deep submicron technologies, the presence of multiple intermittent faults will grow. • With respect to the injection target, we have found significant differences. Buses are the most sensitive targets to intermittent faults. On the other hand, the impact of faults in combinational logic is also important, even similar to that in buses when injecting multiple intermittent faults with higher activity times and burst lengths. It is foreseen that this impact will grow as the effect of masking mechanisms gets reduced in deep submicron technologies, provoking an increase in their sensitiveness to intermittent faults. Last, faults in memory provoke mainly latent errors, while faults in registers cause failures, even in the first activations of the intermittent fault.
• In buses and combinational logic, intermittent faults cause a greater percentage of failures than transient faults. Intermittent faults with the long activation times present a similar impact to that of permanent faults, which are the most damaging faults. On the other hand, transient faults in storage elements (bit-flips) have shown a greater impact than intermittent faults. The bit-flip fault model lasts essentially forever, as the fault will never disappear unless the faulty cell is overwritten. From the results obtained in this work, it seems necessary to add mitigation techniques to deliver fast error detection and correction of intermittent faults, mainly in buses and critical registers.
On the other hand, mitigation techniques in combinational logic may be increasingly required, because the reduction of the transistor feature size in deep submicron technology provokes a reduction of the effectiveness of the inherent masking mechanisms.
