Abstract
Introduction
Dependability analysis [1] is a concern for Integrated Circuits (IC) designers and manufacturers since erroneous behaviors were first reported in space applications in the mid 70's. Phenomena such as alpha particles or heavy ion strikes may lead to dramatic consequences and their occurrence increases with technology downscaling [2] . It is thus mandatory to early analyze the behavior of digital circuits employed in critical applications (avionics, automotive, etc.) affected by these phenomena [3] [4] [5] . Fault-injection experiments have demonstrated as one of the most effective approaches for IC dependability evaluation [5] [6] [7] [8] [9] . The international norm IEC61508 [10] , regulating the requirements of safety-related systems, highly recommends fault-injection in all steps of the development process. Nevertheless, setting up a fault-injection environment is not trivial and requires to tune different parameters (e.g. the fault model, the fault list, the workload, the outputs used as readout points, and the way experimental results are interpreted) that can strongly influence the coherency and the meaningfulness of the final results.
Different fault-injection techniques have been proposed and used in the past. They can be grouped in three different categories: (i) simulation-based, (ii) software-based, and (iii) hardware-based. Simulation-based fault injection [11] [12] [13] , injects faults in a simulative model of the target system. It allows early and detailed dependability analysis and it can be applied when a prototype is not yet available. Moreover, it actually allows modeling any type of fault. However, it is very time-consuming and the effectiveness depends on the accuracy of the system model. Software-based fault-injection [6] [7] [14] targets microprocessor-based systems and resorts to modifications of the software, executed by the microprocessor, to inject faults and to observe their effect. It significantly speeds up the fault injection process. Finally, hardware based fault-injection uses hardware platforms (mainly FPGA based) to inject faults [15] [16] [17] . It notably speeds up the injection process w.r.t. simulation and software based approaches; nevertheless, it requires a synthesizable model of the system and sometime it is difficult to apply. This paper exploits the use of coverage-driven functional verification to build an efficient simulation-based fault-injection environment. Coupling fault-injection with functional verification allows to overcome some of the drawbacks affecting existing environments, such as hard reusability or adaptability to different designs and/or fault models. Verification components available on the market can be easily reused as workloads to inject faults, obtaining at the same time design validation and reliability evaluation. Moreover, the use of standard verification languages enables an easy and configurable way to model injected faults and, by adopting dedicated techniques to collapse faults in the fault list, to reduce the fault list size and therefore the required injection time. Finally, coverage-driven functional verification allows to uniquely correlating workloads, operational profiles, fault list, and final measures.
The proposed solution is intended to work with any functional verification EDA tool available on the market. In this paper, we will refer to Specman Elite ® by Cadence™ [18] (Specman for short in the remaining of the paper), together with the IEEE e Standard Verification Language [19] as a reference example. This research work has been performed in the framework of an implementation of a complete flow (analysis and validation with faultinjection) to extend Failure Modes and Effects Analysis (FMEA) to System On Chip. This flow has then been used during the design of "robust" microcontrollers for automotive applications [20] [21].
The paper is organized as follow: Section 2 introduces the overall architecture of the fault injector environment, Section 3 to 7 detail each required step to setting up the fault-injection flow. Section 8 gives some experimental results, while Section 9 concludes the paper.
The fault-injector architecture
This paragraph overviews the overall architecture (see Figure 1 ) of the proposed faultinjection environment. The main idea is to provide high flexibility and to allow reliability assessment at different stages of the design process and design levels. The fault injector is basically composed of a fault free (Golden) and a faulty Device Under Test (DUT) simulated in parallel and sharing the same workload. They can be described using any type of hardware description language (e.g. VHDL, Verilog, etc.) provided with a simulator able to interface with the selected functional verification tool. The fault-injection flow starts from a list of sensible zones (SENS), i.e., sites of the DUT where to inject faults, and a set of observation/diagnostic points (OBSE/DIAG), i.e., sites where to observe the effects of the injected faults. They can be either internal nets or external pins of the DUT. An OBSE is a site where to measure the result of an injection in terms of the difference w.r.t. the corresponding point in the golden DUT. A DIAG is a site where to measure the result of an injection based on the occurrence (or not) of a given event. Typically, DIAGs are outputs of logic blocks inserted to increase the fault tolerance of the system. By monitoring these sites, it is possible to understand their detection/correction capability. This initial input (i.e., list of SENSs, OBSEs, and DIAGs) can be provided either by the user or obtained from a Failure Modes and Effects Analysis (FMEA) [21] [22], if available. The FMEA is a methodology to analyze potential dependability problems early in the development cycle when it is easy to find solutions. It provides a list of potential failing points, failure modes and classification of hazards easy to translate into SENSs, OBSEs, and DIAGSs. The Environment builder (see Figure 1 
Fault model
The choice of the target fault model is a key point in setting up a fault-injection campaign. Real faults are influenced by different factors such as the target system technology and the environmental working conditions and they can be classified in many different ways. The flexibility of the IEEE e standard Verification Language [19] adopted by Specman (or any other verification language), can be exploited to model faults. It allows the user to describe complex faulty behaviors, not only at the gate level, but also at higher abstraction levels (e.g. a glitch in a given data bus signal as a consequence of a given condition on the address bus). This is the first advantage gained from the use of the IEEE e standard Verification Language together with Specman to build our fault-injection environment. Each fault is modeled with a function called by the fault-injection manger. The function receives different parameters depending on the selected fault model. Figure 2 shows the e code modeling a Single Event Upset (SEU). This simple code waits until the injection event (inj_event) becomes true and then flips the state of the target location inj_port. 
Workload Generator
After selecting the target fault model, another challenge in performing fault-injection experiments is the identification of a meaningful workload to apply to the target DUT during the injections. It is possible to identify two different types of workloads:
• Mission Oriented: the experiments are performed to evaluate the dependability of the DUT when executing its "mission" application. In this case the workload can be either partially or totally fixed (it is the application itself, e.g. a software) and parts of the system may not be considered in the experiments because they are not excited by the application; • Device-oriented: the experiments are performed to evaluate the dependability of the device, regardless its mission application. In this case the workload must be generated so to functionally exercise all the parts of the device.
In case of device oriented fault-injection, again the facilities provided by a functional verification tool (i.e., Specman) and in particular by its test generation engine may help generating high quality test benches. In fact, the goal of the workload generation is to come up with a set of patterns able to activate all (or a subset of) the different parts of the target system in different possible ways. It is actually very similar to the goal pursued in a functional verification flow. For example, in Specman, the test generation engine is based on a random generator that can be driven and constrained in a very flexible way, in order to explore corner-cases and particular critical situations. Moreover it is possible to reuse functional verification test benches written using the IEEE e standard Verification Language [19] or even different languages (it can also include software). The completeness of the workload is automatically measured by using coverage monitors on sensible zones and observation points as described in Section 7. A workload is considered complete if all the sensible zones are excited at least once, and all the observation point monitors are triggered at least once.
Fault List Generator
As already introduced in Section 2, starting from the list of sensible zones it is possible to generate the target fault list for the injection campaign. The idea is to provide the user with two different approaches.
The first possibility is the random generation performed trough the Randomizer block of Figure 1 . In this case, a subset of the complete set F of possible faults is randomly chosen to compose the target fault list. Each selected fault f ∈ F is identified by its fault location (SENS) and injection time. The use of the random approach allows reducing the fault-injection environment setup time but it may not lead to optimal results. Actually, this approach uniformly distributes faults in different locations/execution times regardless the real importance of the target zone for the behavior of the application. The alternative to the random fault generation is an operational profile based fault list generation [23] . An Operational Profile (OP) is a collection of information about relevant fault-free system activities. Traced information is read/write activities associated with signals, or system elements (register, buses, memory elements, etc.), but they may also include other more high level information like the most probable expected sets of inputs that the system or application should receive. Essentially, the purpose of the operational profile is to better understand the conditions in which the system or the application has to work (the workload), and then to analyze this information to target only faults that actually may lead to errors. This approach allows to compact the fault list and to consider non-trivial faults only. In particular, faults that lead to predictable effects, such as "no effect", are kept in the fault list (they still contribute to the final measures) but not injected. The achievable fault list reduction factor depends on the workload and on the complexity of the device under test. We can therefore split the fault list generation into three different steps: (i) the OP generation, (ii) the analysis of the OP and the generation of the compacted fault list, (iii) the optional use of the randomizer to randomly select faults from the compacted list. This last step is used only if the size of the compacted fault list is still too high for the computational resources. The following subsections will detail the OP generation and the OP analysis steps.
Operational Profile Generation
An Operational Profile (OP) is an instrument to optimize the execution time of a fault-injection experiment. It consists of a collection of information (log) about relevant fault-free system activities on each potential fault location of the target system when the selected workload is applied. Logged activities include read/write associated with any element of the DUT model (signals, variables, registers, buses, etc.). Information is collected using simulator breakpoints. The operational profiler generates the script to trace the proper signals (including the breakpoint instructions able to log when a location is accessed by the DUT either for a write or a read operation).
As an example Figure 3 shows part of a VHDL code representing a FIFO core where we want to trace read and write memory accesses, i.e. operations on the ram variable. It is easy to see that we have two locations to trace: at row 58 (Figure 3 ) a write of ram occurs whereas at row 66 a read of ram occurs. The simulator commands used to log the two locations are shown in Figure 4 . This example is based on Cadence NCSIM simulator; it instruments one breakpoint (stop) for each accessed location. if ((write_enb = '1') and (full = '0')) then 58.
ram(w_ptr) := data_in; 59.
empty <= '0'; 60.
w_ptr:= (w_ptr + 1) mod (16 The result of this script is reported in Figure 5 . 
The information contained in the OP can be efficiently used to collapse the fault list associated with a given DUT and a given list of sensible zones with a consequent reduction of the simulation time. This is possible by introducing additional constraints to avoid the selection of inactive fault locations.
For example, let us consider transient faults (e.g. SEU). A fault location is sensitive at a given time t 1 if the next operation performed on the same location at time t 2 >t 1 is a read operation (i.e., it does not overwrite the effect of the fault). Therefore, in case of transient faults, for each fault location, the period between a read and a write operation is inactive (no faults need to be injected). Figure 6 shows an example of an OP analysis. The given OP shows that a selected fault location (e.g. a flip-flop) is read at simulation time 20, 30, and 60 and written at simulation time 10, 40, 50, and 70. Using the constraint previously introduced the only intervals available for the injection are: 10<t inj <20, 30<t inj <40 and 50<t inj <60. Injecting in other time instants (e.g. 40 or 50) would be useless since the fault would be overwritten by the next write operation. Moreover, to further reduce the size of the target fault list, the user can specify "condition" signals for injection, called "effect condition" and "no effect condition": a fault must be injected in these zones only if the "effect condition" signal is "true". A detailed explanation of the proposed approach can be found in [24] . Moreover, the user can implement his collapsing rules based on his fault models.
Fault-Injection Manager
The Fault-Injection Manager performs the actual injection of faults inside the DUT. It resorts to the full controllability and observability of DUT internal signals provided by the functional verification tool. One fault at a time is injected; based on the fault model the injection engine stops the simulation, injects the fault (i.e., it executes the verification code that model the injected fault as described in Section 3), and then resumes the simulation. At the same time observation points (OBSEs/DIAGs) behavior are logged for later analysis. An interesting improvement is the use of so called "stop run" timers. They control the behavior of each simulation during the injection campaign in order to reduce the total simulation time. Examples of stop timers are: a timer stops the simulation if the simulation time exceeds the expected test bench duration; a timer stops the simulation after a given period after the injection of a given fault if no activity has been detected on the observation/diagnostic points, etc.
Result Analyzer
The result analyzer evaluates the result of the fault-injection, i.e., the reaction of the system to the injected faults. The analysis is based on the information collected by the monitors placed on OBSEs and DIAGs (see Section 2). Fault effects can be classified based on the circuit functionality in two ways: "failure" and "no-effect". The failure can manifest as a "data mismatch" or a "time alteration" between the golden and the faulty DUT. In case of no-effect the error is overwritten or corrected by a fault tolerance mechanism and, it does not propagate in the circuit. In this case, the use of DIAGs (Section 2) allows understanding if the circuit really tolerates the fault. In case of data mismatch the faulty DUT produces a wrong output w.r.t. the golden DUT. In this case the error has been propagated to the output of the circuit generating a wrong behavior. In the last situation, i.e., timing alteration, the circuit produces a correct result but with different timings. Depending on the constraints of the application and the introduced overhead this situation may be acceptable or not.
Another important measure provided by the result analyzer is a so called coverage. In the fault-injection terminology, the coverage is defined as the probability of system recovery when a fault appears, i.e., saying that a system has a fault-coverage of 99% means that over the totality of the injected faults, only 1% resulted in an error or failure. However, given the complexity of modern systems, it is necessary to provide more accurate measures. The traditional coverage is not a real assessment of the system reliability without a correlation with the "accuracy" of the fault list. In general, the smaller the fault-list is, the less accurate the final reliability measures are. We introduce the verification concept of functional coverage defined as a systematic procedure to assess how and how much each verification item, or specification requirement, has been covered by the tests. These verification items are in fact called "coverage items", and they depend on the system architecture and also on the application running on the system itself. We can therefore define a "cross-coverage" (cross between functional coverage and fault-injection coverage) as a measure of how many times each coverage item has been hit by a fault, independently if the final effect of such event is a failure or not. If at the end of the process a coverage item has not been cross-covered up to a certain threshold that means the fault list was not enough accurate. A measure of the cross-coverage is also important to identify critical parts of the system for which the OP algorithm was not accurate enough.
Experimental Results
To proof the concepts of the proposed tool, we performed a set of experiments on a simple router device injecting SEUs. It accepts data packets on a single input port and, routes the packets to one of three output channels: channel0, channel1, or channel2. Each channel includes a buffer used to store data to send in output. The buffer is implemented as a FIFO 8 x 16 (16 words of 8 bits). As target fault locations we selected the three FIFOs. We performed two sets of experiments both using the operational profile based generation. The first one adopts the collapsing algorithm proposed in Section 5.2, while the second one does no collapse the fault list. Obviously the experiments use the same workload. The aim of the experiments is to show the efficiency of the tool (using the Operational Profiler and the Collapser) in terms of injection time and percentage of fault effects 1 . The selected workload is device oriented and generated using Specman itself. It is very important to underline that the components used to generate the workload for the injection campaigns are exactly the same adopted during the functional verification, allowing high reusability. The resulting workload generates an equal number of packets for each channel of the router. The time overhead introduced by the generation of the operational profile, w.r.t. a fault free 1 In a campaign without collapsing the fault list; the percentage of no effect faults should be higher than under collapsed fault list simulation, is equal to 16,3%. The operational profile generation shows that each ram is accessed 3216 times during the simulation.
In order to reduce the simulation time we applied the collapsing algorithm introduced in Section 5.2. Figure 7 sketches the performance of the algorithm in terms of predicted fault effects. The chart shows for each fault location the number of candidate faults and, the relative number of forecasted "no effect" faults. It is easy to see that we have a 50% reduction of the number of real injections. The "unknown effect" corresponds to the actual fault-injection list. We have 1622 injections for each ram (1622 * 3 = 4886) instead of 3216 injections (3216 * 3 = 9648). The results of the injection campaign with the collapsed fault list are in Figure 8 .a. The required time to perform the 4886 simulations is about 5 hours. To show the effectiveness of the collapsing procedure, Figure 8 .b. shows the injection results using the complete fault list (i.e. 3216 injections for each ram). The required simulation time was in this case of 11 hours (more than twice the simulation time with the collapsed fault list). Moreover, we obtained a higher number of "no effect" w.r.t. a data mismatch violations. More complex experiments with the tool have been performed during the validation of a real safety critical system based on a 32-bit RISC processor. An example of results extrapolated by these fault-injection campaigns are: 125s of CPU time for total operational profile extraction, 464K total lines of operational profile, 6s of CPU time for total collapsing, 25K faults after collapsing, 56.6s of average single injection time (the workload of this example was very complex).
Conclusions
This paper presented a fault-injection environment based on the functionalities provided by EDA functional verification tools and languages. The main innovative features of the proposed tool are: the integration of verification and fault-injection methodologies in the same environment; the possibility to work with different description languages and at different abstraction levels; the use of a standard verification languages to model faults in a systematic and well-defined way; the use of Operational Profiles to generate effective and non trivial fault lists and finally the use of concepts of coverage to deliver precise measures about the fault-injection experiment completeness. Experimental results show the efficiency of the proposed flow when adopting collapsing rules.
