 The auTomoTive indusTry has undergone radical changes driven by several megatrends in recent years. First, as governments push for safer and cleaner transportation, the transition to emission-free vehicles was put forth on the agenda of car manufacturers, which leads to the accelerated electrification of automotive vehicles. Second, as an essential part of the Internet of Things, automobiles will be connected to the outside world, enabling infotainment and advanced driver-assistance systems (ADAS). Last but not least, with the rapid progress in artificial intelligence and machine learning, what once seemed utopia until recently, i.e., the development of autonomous vehicles, was made possible and pursued by many traditional automotive manufacturers and new players in the market [1] . Driven by these megatrends, most of the recently innovative features in automobiles are implemented in electronics and software, resulting in a sheer increase of complexities in electronic and software components in a vehicle. A modern car can contain over 200 electronic control units (ECUs), several in-vehicle communication networks, and several hundred megabytes of software [2] . An obvious consequence of the increased complexities is the increased defects and failures in the electronic and software components that can potentially lead to accidents on the road. Therefore, paramount significance has been imposed on functional safety of the electrical and electronic (E/E) systems in the vehicle.
As part of overall vehicle safety, functional safety requires that the malfunctions of electronic and software components should not cause harm to the road vehicles and pedestrians. It is usually achieved by making sure that the system can detect the failure of the functions, properly react in a timely manner to mitigate the hazards of the malfunctioning behavior, e.g., by providing emergency reactions or notifying the drivers to take actions, and eventually transit the system to a safe state. How to detect and mitigate the hazards can largely depend on the specific function that fails and environmental conditions of the function, including the overall vehicle architecture. As we move into the era of increasing autonomy in automobiles, the requirements of functional safety can change drastically as the level of autonomy increases and the vehicle architecture evolves.
SAE International defines six levels of automation for the driving functions in autonomous vehicles, where level 0 means no automation at all and level 5 means full automation [3] . The automation levels of current driving functions range from level 0 to level 2.5, which means partial automation of certain functions has been enabled. Accordingly, the stateof-the-art safety architectures fall into the category of fail-safe systems, meaning, when the function fails, the system should transit to a known safe state so that the failure of the function could not cause any harm. In many occasions, the system is no longer fully operational in the safe state, and the driver will get the notification that he or she needs to take actions. Functional safety of such automotive E/E systems often relies on the assumption that there is a driver in the vehicle as the last resort and that certain functions are still operated by mechanical systems. However, when the human factor is gradually taken out of the equation as the autonomy level increases and more components of a vehicle are electrified, a fail-safe architecture would fall short of functional safety requirements. Fail-operational architectures, where the system can conti nue operation in the presence of component failures, will be in demand to fulfill the requirements.
Traditionally structured like a tiered pyramid, the hierarchy of automotive supply chain features vehicle manufacturers at the top, often referred to as original equipment manufacturers (OEMs). OEMs purchase parts or systems from Tier 1 suppliers and integrate them into the vehicles. The parts or systems from Tier 1 suppliers usually fulfill a specific function at the vehicle level, e.g., engine control unit. Underneath Tier 1 suppliers are a wide range of Tier 2 suppliers, who supply components of the systems such as SoCs and accompanying software. Semiconductor companies are often considered as Tier 2 suppliers.
As requirements of functional safety are closely associated with the function at the vehicle level, they were traditionally driven by OEMs or Tier 1 suppliers. In fact, as Tier 1 suppliers usually implemented the systems realizing the function, they played a predominant role in developing the technical safety concept. Dynamics of the automotive supply chain have also been undergoing changes driven by several technological trends. First, as the autonomous driving functions have to orchestrate a wide range of systems in the vehicle, OEMs are getting more involved with technical safety concepts and are more willing to get semiconductor companies involved in the discussion. Second, as SoC integration technologies advance, more and more functions of a system can be integrated into an SoC or a system-in-package (SiP) [4] . It is not too far away from the day when a function at vehicle level can be realized by a single chip. Third, several semiconductor suppliers are also evolving toward system suppliers providing entire solutions as market differentiators. In a nutshell, the hierarchy of automotive supply chain has been flattened [5] , and the lines between the roles and responsibilities of different players have been blurring. Semiconductor companies are playing an increasingly active role in the functional safety development of a system and the knowledge of functional safety is quite sought-after for engineers developing automotive SoCs.
Since the release of its first edition in 2011, ISO 26262 has been the mandate for functional safety of E/E systems on road vehicles. It provides guidelines of functional safety development flow for automotive E/E systems at the vehicle throughout their life cycles. As automotive SoCs are part of the E/E systems and their development do not span over the whole life cycle of the systems, semiconductor companies need to tailor the ISO 26262 development flow per their own needs. The first edition of the standard does not provide guidelines specific to development of automotive grade semiconductors. The second edition, released in 2018, added Part 11 dedicated to guidelines on semiconductors as functional safety for automotive SoCs has gained a lot of traction recently. Yet, the overall descriptions of ISO 26262 are based on the perspective of OEMs or Tier 1 suppliers, instead of semiconductor companies' perspective.
This article is intended to give a tutorial introduction to functional safety for design, verification, and validation of automotive SoCs for applications in ADAS and future autonomous driving. We review the current practices and discuss challenges in achieving functional safety for semiconductor professionals in the automotive space. By explaining the problems and articulating the challenges in this area, we would like to bring them to the attention of researchers in the semiconductor field, in the hope of soliciting advanced technologies and solutions to the challenges.
What is functional safety
Safety of road vehicles has been the essential concern of the automotive industry since the birth of automobiles. From a contemporary perspective, a safe vehicle comprises the following four elements [6] :
• Road safety concerns reducing accidents by human errors. As of today, 94% of road accidents are caused by human errors [7] . ADAS technologies have been developed to diminish the accidents by human errors, and many car manufacturers are aggressively working on autonomous driving to eventually get the human factor out of the equation.
• Device reliability concerns zero failures of the device. The goal is to improve the manufacturing quality and design robustness to reduce the failure rates of the components.
• Functional safety concerns zero accidents by the failures of systems in automobiles.
• Security is to prevent the car from being hacked, given that every car is being connected to each other in the near future.
As we define functional safety as one of the pillars of the overall vehicle safety, we review the difference of functional safety from other pillars of vehicle safety. Hopefully, we can elucidate what functional safety is by elaborating what functional safety is not.
Functional safety is not safety of the intended function
ADAS systems use complex sensors and processing algorithms to perceive the surrounding environment and driving situation in order to assist drivers and thus reduce human errors. The proper situational awareness of an ADAS system is critical to safety of the vehicle. Unlike many well-established systems such as Dynamic Stability Control Systems, unintended behaviors of ADAS systems, due to technological and system shortcomings and/or reasonably foreseeable misuse, can cause hazardous events. Safety of the intended function concerns safety of the use of a specific function (often in ADAS systems), in absence of the faults covered by ISO 26262. For example, cameras are heavily used for object detection in ADAS systems. However, the perception of the cameras could have technological or performance limitations, which should be foreseen before the system is put in use. An example limitation for CMOS cameras is that, when a vehicle gets out of a long dark tunnel, the camera might be saturated for several seconds so that it could not detect objects during this period. In such a scenario, there is no fault or failure in the camera, and therefore, there is no issue with functional safety. However, the performance limitation of the camera could cause hazardous events if the vehicle solely relies on perception from camera to enable driver assistance. Therefore, preventative measures such as using diverse sensors for perception should be in place to avoid such misuse. Safety of intended functions ought to be addressed by ISO 21448, which is under development [8] .
Functional safety is not reliability
Reliability engineering concerns failure mechanisms of components and how to improve manufacturing quality and design robustness to reduce failure rates of the components. The failure rate can be defined as 1) frequency with which a component fails, expressed in failures per unit of time (often as hours for semiconductor), or 2) total number of failure within a population, divided by the time expended by that population during a particular measurement interval. Safety engineering concerns the failures of systems due to failures of components and how to control and mitigate failure effects of the components to prevent system failures from causing harm. In other words, safety engineering is intended to address the fact that no component could be perfectly reliable. Although reliability does not equate functional safety, it does impact functional safety in a way that the failure rate data could influence the safety measures used to control and mitigate the component failures. For example, semiconductor technology scaling could aggravate transient faults and soft errors on the chip, which would drive the safety architecture to encompass more effective safety mechanisms for such failures.
Functional safety is not security
Security requires that E/E systems and software components of the vehicles must be resilient against system hacks. Security of a system could be violated by compromising confidentiality and integrity of the assets in the system. Examples of assets in automotive E/E systems include trim values, firmware execution flows, cryptographic keys, and user identity information. Confidentiality requires that an asset cannot be accessed by unauthorized agents. This requirement applies to assets such as user private information. Integrity requires that an asset is protected from unauthorized modifications. This requirement is usually essential for assets serving as the root of trust, e.g., secure boot firmware [9] . Traditionally, security used to be less of a concern for automobiles as security attacks were only Tutorial feasible when malicious agents had physical access to the automobiles. However, as automobiles are getting highly connected to each other and the external world, remote attacks have become possible.
Safety concerns that failures of systems should not cause harmful events, while failures could attribute to development bugs or random hardware faults. Security concerns that the confidentiality and integrity should not be compromised by malicious agents, while the loss of integrity could possibly cause harmful events. A notable and somewhat informal difference is that security concerns failure due to deliberate attacks by malicious agents, whereas safety concerns failure due to unintended faults. The security concerns for automotive E/E systems ought to be addressed in ISO 21434 "Road Vehicles-Cybersecurity Engineering," which is under development [10] .
Functional safety is not availability
Availability of systems is often highly desirable, however, not always necessary for maintaining safety of the vehicle. Unavailability of the system function does not always mean a safety violation. For example, if a fault in the ECU is detected during the self-test when the engine is being ignited, the ECU will abort the engine from starting. In this scenario, availability is lost in the presence of the failure of ECU functions. However, the vehicle is still in a safe state, and thus, functional safety is maintained.
ISO26262: Risk-based development approach
IEC 61508 is the basic functional safety standard applicable to E/E systems in all kinds of industries. It covers the safety management, system/ hardware design, software design, production, and operation of safety-related E/E systems. ISO 26262 is an adaptation of IEC 61508 as the mandate for functional safety of automotive E/E systems. ISO 26262 defines functional safety as absence of unreasonable risks due to hazards caused by malfunctioning behavior of E/E systems. ISO 26262 provides a standardized framework to determine the risks, and guidelines on how to manage the development process to reduce the risks to an acceptable level [11] . The management of functional safety spans through the lifecycle of a safety-related product, including the concept phase, development phase, and production phase, as shown in Figure 1 . As automotive SoCs are part of a safetyrelated product at the vehicle level, it will be beneficial for semiconductor professionals to understand the entire safety lifecycle and corresponding roles and responsibilities of OEMs, Tier 1 suppliers and semiconductor suppliers. Therefore, we describe the development activities in each phase of the lifecycle in this section before focusing on automotive SoCs.
Safety product concept

Item definition
The very beginning of the product concept is to define the item. ISO 26262 defines an item as system or array of systems that implement a function or part of function at the vehicle level. As it concerns the function at the vehicle level, it is typically done by OEMs. The definition of the item can include descriptions of the function, functional block diagrams, the environment conditions that the item interacts with, legal requirements, and external measures for risk reduction.
Initialization of safety lifecycle
Based on the item definition, the safety lifecycle is initiated by differentiating between a new development or modification to an existing item. If it is a modification to an existing item, the activities in the lifecycle are tailored per requirements. In this article, we focus on describing the process for new development.
Hazard analysis and risk assessment
The goal of hazard analysis and risk assessment (HARA) is to quantify the risks of the hazards caused by the failure of the functions. It is usually the responsibility of OEMs. First, all possible malfunctions of the item are defined. The definition of the malfunction usually considers two aspects: 1) the function is not correctly executed when required, and 2) the function is executed when not required. Then, vehicle operating conditions and driving scenarios where malfunctions could occur are considered. The combination of malfunctions, and vehicle operating conditions and driving scenarios are considered as all possible hazardous events. For each hazardous event, the associated risk is assessed based on three factors: severity (how much harm?), exposure (how often is it likely to happen?), and controllability (what is the likelihood that the hazard can be controlled?).
Severity concerns the potential harm that the hazardous event can cause to the persons at risk, based on possible injuries that could occur. ISO 26262 defines four severity classes as shown in Table 1 .
Exposure considers either the duration or frequency of the considered situation. Duration is used when the hazardous event occurs due to the sudden malfunction during the situation under consideration. For example, headlights fail off when driving on a country road at night. Frequency is used when a malfunction exists and the hazardous event only occurs, when the situation under consideration occurs. For example, headlights fail to turn on when driving into a tunnel and trying to turn on headlights. ISO 26262 defines five severity classes as shown in Table 2 .
Controllability considers the control of the hazards by the drivers and/or other traffic participants such as pedestrians. ISO 26262 defines five controllability classes as shown in Table 3 .
The risk levels associated with the combination of hazardous events and driving scenarios can be determined based on these three factors, known as Automotive Safety Integrity Level (ASIL). Four ASILs are defined by ISO 26262: ASIL A, ASIL B, ASIL C, and ASIL D, where ASIL A is the lowest safety integrity level and ASIL D the highest one. In addition to these four ASILs, the class quality management (QM) denotes standard quality assurance is sufficient without requirements to comply with ISO 26262. A rule of thumb for determining ASIL based on the severity, exposure, and controllability classes is: 1) if any of the three falls into class 0 (i.e., S0, E0, C0), then it falls into QM; and 2) otherwise, add up the class of S, E, and C. If the sum is 10, then it is ASIL D; ASIL C if value is 9, ASIL B if value is 8, ASIL A if value is 7, and QM if value is less than 7.
For each malfunction, an ASIL is assigned based on the highest ASIL of the hazardous events caused by this malfunction under different driving scenarios, and a safety goal is determined for each malfunction along with the assigned ASIL. If multiple safety goals are similar, they can be combined into a single safety goal, and the highest ASIL should be assigned to the combined safety goal.
Functional safety concept
Functional safety concept is typically done by OEMs or Tier 1 suppliers. Based on the item definition and safety goals, the functional safety concept is specified considering preliminary architectural assumptions. The goal of functional safety concept is to derive functional safety requirements and allocate Description Incredible Very low probability Low probability Medium probability High probability Tutorial them to architectural elements. In practice, the functional safety concept follows five steps. First, for each safety goal, safety-related characteristics specific to the function are listed and examined. The characteristics include but are not limited to the following:
• Safe state: The operating mode without unacceptable risks in case of failure of the item.
• Fault tolerance time interval (FTTI): The maximum time span that a system will not violate the safety goal in the presence of faults.
• Warning concept: Measures to pass information to the driver regarding the potentially dangerous condition.
• Emergency operation: When a safe state cannot be directly reached after the detection of the fault, emergency operation can be used to provide safety until the transition to a safe state is reached.
Second, at least one functional safety requirement is derived for each safety goal. Functional safety requirements are regarding the behavior of the item and are independent of implementation. The requirements can include but are not limited to the following aspects:
• Requirements of the functional limit conditions, e.g., the function should not be active within a speed limit.
• Requirements of functional redundancy and/or additional functions to achieve the safety goal.
• Requirements to functions of non-E/E elements, e.g., mechanical hardware.
• Requirements of safety features controlled by the driver, e.g., the capability of manually shutting off the function.
Third, based on the functional safety requirements, a preliminary functional safety architecture is derived by refinement of the functional architecture of the item. It is usually done by taking the functional block diagram from the item definition and expanding it by considering the functional safety requirements. If necessary, new requirements can be added or the existing requirements can be modified. The refinement is an iterative process.
Fourth, functional safety requirements are allocated to architectural elements and ASILs are allocated accordingly. Usually, the ASIL of a functional safety requirement is inherited from the associated safety goal unless there is an ASIL decomposition.
Finally, the functional safety concept needs to go through verification reviews with a certain degree of independence.
Safety product development
Upon the initiation of safety product development, various activities are planned. The planned activities usually include system design and technical safety concept, hardware development, software development, system integration and testing, safety validation, functional safety assessment, and finally, release for production. This section focuses on the development of the product as an item. For each component of the item, similar disciplines could be applied. Also, hardware development and software development generally follow procedures similar to system development. Therefore, we do not elaborate on hardware and software development in this section and will provide more details in the context of automotive SoCs in the next section.
System design and technical safety concept
System design and technical safety concept of an item are typically the responsibilities of Tier 1 suppliers. System design refines the functional view of the architecture with implementation-related technical views, whereas technical safety concept refines the functional safety requirements into technical safety requirements allocated to concrete architectural elements. System design and technical safety concept usually go hand in hand in the following steps.
First, based on the functional block diagrams and functional safety concept, a system design draft is developed with all involved components and all interfaces. This maps the functional view of the architecture into the technical view of the architecture. For example, in the functional architecture, we could have a block called object detection without specifying how it is realized. In system design, it will be specified that the object detection is realized by radar, or lidar or their combination.
Second, technical safety requirements are derived from functional safety requirements. ASILs are inherited from the corresponding functional safety requirements. Also, requirements for hardware architecture metrics and requirements for safety goal violation due to failure rates from random hardware failures are included.
Third, technical safety requirements are allocated to the components. Sometimes, an extension of the system is required to add additional measures, to fulfill the safety requirements, e.g., redundant system elements, or other safety mechanisms. Safety mechanisms refer to functions implemented by the E/E system, or by other technologies, to detect faults or control failures in order to achieve or maintain the safe state.
Finally, the system design and technical safety concept are verified, often through safety analysis. Such analyses include fault tree analysis (FTA), failure modes and effects analysis (FMEA), and failure modes, effects, and diagnostic analysis (FMEDA). It is an iterative process to analyze and optimize the safety system design.
The above procedure applies to the system design at the vehicle level. For each component of the system, similar procedures could be applied for the technical safety concept of the component. In addition, for component-level technical safety concept, there is often a need to allocate technical safety requirements to hardware and software and decide the hardware software interface (HSI).
System integration and testing
The components of a system or subcomponents of a component can either be designed in-house or purchased from Tier 2 or Tier 3 suppliers and then integrated into the system. There are three levels of integration: hardware software integration within a component (often done by component suppliers), integration of components within the system (often done by Tier 1 suppliers), and vehicle integration of the item (often done by OEMs).
Testing must be performed at each level of integration. Various test methods could be used to show the following objectives are achieved:
• Correctness of system safety requirements implementation: An example testing method for this objective is requirement-based testing.
• Correctness of safety mechanism functionality:
An example testing method is fault injection testing, which introduces logical or physical faults into the components or systems to invoke the safety mechanisms.
• Correctness and completeness of internal and external interfaces implementation: This could be done by testing the connectivity, compatibility, and timing of the interfaces and checking the consistency of the interface protocols.
• Diagnostic coverage of hardware fault detection mechanisms: This could be done by faultinjection testing.
• Level of robustness: This could be done by stress testing, which verifies the correct operation of the system under high operational loads or high demands from the environment, e.g., extreme temperatures.
Safety validation
Safety validation is performed for the item integrated on representative vehicles to validate the safety goals by testing against the functional safety requirements, at the vehicle level. It is intended to provide evidence that all safety measures are effective under intended use, and the safety goals are correct and fully achieved at the vehicle level. The test cases must consider the safety goals and the functional safety concept. Validation methods include and are not limited to:
• Long-term tests, such as vehicle driving schedules and captured test fleets.
• User tests under real-life conditions, panel or blind tests and expert panels.
Functional safety assessment
Before release of production, functional safety achieved by the product is assessed. Functional safety assessment can be done as an entire single step or as several steps during the development process. The scope of the assessment includes the following:
• Follow-up of recommendations and corrective actions from previous functional safety assessments (if there is any).
• Evaluation of compliance with the work products required by the safety plan.
• Evaluation of the implementation of functional safety processes.
• Review of the appropriateness and effectiveness of the implemented safety measures.
The assessment report contains a recommendation for acceptance, conditional acceptance, or rejection of the achieved functional safety of the product.
Release for production
The release for production is allowed when the safety case and the functional safety assessment is complete and approved.
Production and operation of safety-related products
The management of functional safety does not stop after the release for production. The following scenarios should be managed with proper measures.
Tutorial
Change management
If any change to the item is required during production, potential impact to functional safety should be thoroughly assessed before a change is made.
Customer returns
Field monitoring process should be instrumented to collect data to be analyzed to detect the presence of any functional safety issues. Customer returns should be carefully analyzed to determine if there are any functional safety related issues and their impacts.
Anomalous events
Anomalous events in the production process should be analyzed to evaluate if they can impact the functional safety of the products.
Functional safety of automotive SoCs
In this section, we zoom into the functional safety development of hardware and software in the context of safety-related automotive SoCs. There are two paradigms of how safety-related automotive SoCs are developed. On the one hand, an SoC can be designed in context, i.e., to fulfill a custom order of a Tier 1 supplier for a specific system. In this case, the SoC development starts with taking the system-level technical safety concept as the input to come up with the SoC-level technical safety concept. On the other hand, an SoC can be developed as a safety element out of context (SEOoC), meaning that it is not bound to a specific system but can be used in several systems for similar applications. In this case, the SoC development starts with assumptions of the system-level technical safety concept and the system design. As semiconductor companies strive to drive value creation by developing reference designs using their chips for Tier 1 suppliers, the SEOoC development paradigm has been gaining popularity lately. SEOoC development requires more efforts as it tends to make more conservative assumptions of the system.
Note that as automotive SoCs fulfill only part of the item, many major steps (item definition, HARA, and functional safety concept) in the concept phase of the item will not be present in the development of automotive SoCs. Many other steps of the development process of safety-related SoCs are similar to that of an item described in the "ISO26262: Riskbased development approach" section with tailoring, except that safety validation is only performed at the vehicle level. Therefore, in this section, we do not repeat explaining the process but will dive into some technical activities along the process. As the concept of faults is essential for understanding the safety architecture and implementation, we first provide a review of faults in the context of functional safety. Then, we discuss the safety analyses used to drive the development and verification of safety architectures. We also elaborate on common safety mechanisms used in automotive chips and then discuss the verification of safety-related SoCs.
Faults in context of functional safety
The reduction of risks is accomplished by detecting, controlling, or mitigating the faults that could potentially cause the violation of safety goals. ISO 26262 makes a distinction between fault, error, and failure. The abnormal condition that can cause an element or item to fail is a fault. An error refers to the discrepancy between a computed, observed, or measured value or condition and the true, specified, or theoretically correct value or condition. A failure refers to the termination of the ability of an element or an item to perform a function as required. Faults can progress to errors and eventually failures if they are not controlled or mitigated in a timely manner. Note that a failure at the component level could be a fault at the system level.
ISO 26262 concerns the following two types of faults:
• Systematic faults are due to specification or design issues and are manifested in a deterministic way. Systematic faults can occur to software and hardware, and can only be eliminated by improving the development process with measures such as safety analyses and verification. The most common systematic fault is a development bug.
• Random hardware faults occur unpredictably in the lifetime of a hardware component and are due to physical processes such as wear-out, physical degradation or environmental stress. Random hardware faults can be reduced by reliability engineering, but cannot be completely eliminated.
Random hardware faults can be further divided into the following two categories:
• Permanent faults occur and stay until removed or repaired. Examples include stuck-at faults and bridging faults.
• Transient faults occur once and subsequently disappear. Transient faults can appear due to causes such as electromagnetic interference or alpha particles. As the technology node scales down, memory elements such as flip-flops and memory arrays are increasingly vulnerable to transient faults. Examples include single event upset (SEU) and single event transient (SET).
Safety mechanisms are measures and technologies placed in the product to detect, control and mitigate random hardware faults. The overall effectiveness of the safety mechanisms in an SoC safety architecture can be quantified by hardware architectural metrics. Before giving the definitions of the metrics, we first introduce the fault categorization by ISO 26262, as illustrated by the flowchart in Figure 2 . The categorization assumes that there is a safety goal and heavily relies on the judgment of the fault effects with regard to the safety goal.
To categorize a fault, we need to first determine whether the fault is within a safety-related element. If not, the fault is not of concern and can be seen as a safe fault and is not included in the safety analysis. If the element is safety-related, then the question is whether the fault by itself has the potential of violating a safety goal directly in the absence of safety mechanisms. If not, then the fault can be put aside for now as it will be considered later. If the fault can potentially violate the safety goal, then the next question is whether there is any safety mechanism in place. If not, the fault is a single-point fault (SPF), which is generally undesired for a safety architecture. In situations where there are safety mechanisms in place, not all faults can be necessarily covered. If the fault cannot be covered by safety mechanisms, it is categorized as a residual fault.
For the faults covered by safety mechanisms with regarding SPF, they would not violate the safety goals in the presence of the safety mechanisms. Together with the faults that cannot potentially violate the safety goal (which we put aside previously), they are evaluated based on their potential of violating the safety goal in combination with another independent fault-a second-order effect. If there is no such potential, the fault is considered as a safe fault. If the dual-point fault has the potential to violate the safety goal, then it is evaluated whether there are safety mechanisms to protect against it. If there is no safety mechanism with this regard or the safety mechanisms cannot cover it, then it is categorized as latent multiple-point faults (MPFs, Latent) or latent fault for short. Otherwise, it is a detected multiple-point fault (MPFs, Detected). There is no need to consider a fault in combination with two other independent faults as ISO 26262 considers MPFs with order greater than two as safe faults (The probability of occurrence is extremely low in the automotive space).
Also, note the difference between a latent fault and a latent defect. A latent defect is a reliability concern and refers to a defect that goes undetected during production testing and manifests itself during the operational life of the device. A latent fault is a safety issue and refers to a fault that does not cause harm by itself but can cause harm in the presence of another independent fault. A latent defect can fall into any fault category in the context of functional safety.
Safety analysis
Safety analysis methods could be used for architecting the SoC safety architecture, identifying system weak points and allocating safety mechanisms. Safety analysis can be initially conducted during the architecture design phase and can be later performed as a means to verify the robustness of the safety architecture implementation. We go over several common safety analysis methods used in automotive SoC development.
Fault tree analysis
FTA is a top-down (deductive) approach that starts with a failure effect to analyze all possible failure causes [12] . It uses the fault tree as a graphical representation of logic combinations of failures. Figure 3 shows an example fault tree for analyzing the failure of a microcontroller unit (MCU). It starts with a top event and breaks it down to basic events using logic gates like and or or. For each safety goal, a fault tree can be drawn and the top event is usually the hazard that leads to violation of the safety goal. In this example, the top event is that MCU performs wrong computation without indication. The top event at the MCU can be attributed to the logical combination of failures of subcomponents of the MCU. The failures of the subcomponents of the MCU, e.g., CORE0, can be further analyzed to trace the causes to its subcomponents. A basic event is an event that cannot be further analyzed and it stops the fault tree at a leaf node. By tracing the causes of failures down to the basic events, safety architects can identify places in the architecture to allocate safety mechanisms to control the basic events. Therefore, FTA could serve as an effective vehicle to drive the development of safety architecture.
After construction of a fault tree, cut set analysis can be performed to identify if there is a single-point failure in the safety architecture [13] . Cut sets are the unique combinations of basic events that can cause the top event. Specifically, a cut set is said to be a minimal cut set if, when any basic event is removed from the set, the remaining events collectively are no longer a cut set. The order of a cut set refers to the number of basic events in the cut set. For example, in Figure 3 , EV1 and EV2 constitute a minimal cut set of order 2. A minimal cut set of order 1 would indicate a single-point failure in the safety architecture, which indicates that safety mechanisms should be added to control it. Although FTA is usually used as a qualitative analysis approach, there is also a quantitative version of FTA, which can calculate the probability of the top event [14] .
Failure modes and effects analysis
FMEA is a bottom-up (inductive) approach that focuses on individual parts of the system, how they can fail (failure modes), and the impact of these failures on the system (effects) [15] . FMEA could be complementary to FTA and could be used for cross examination.
Failure modes, effects, and diagnostic analysis
FMEDA is a systematic approach to identify and evaluate failure modes, effects, and diagnostic techniques, and to document the system. In FMEDA of a hardware element, the raw failure rates and failure modes are identified for each component of the element, as well as the failure mode distributions. Then, the failure effect-whether the failure mode has the potential to violate the safety goal-is evaluated. Also, the safety mechanisms with regard to the failure modes and their diagnostic coverage are identified. Based on the aforementioned data, FMEDA quantifies the robustness of the safety architecture by hardware architectural metrics: single-point fault metric (SPFM) and latent fault metric (LFM).
Suppose the raw failure rate of a safety-related hardware element is λ . From the fault categorization, we have
where λ SPF is the failure rate associated with singlepoint faults, λ RF is the failure rate associated with 
where ∑ HW, SR λ x is the sum of λ x of the safety-related hardware elements to be considered. SPFM quantifies the robustness of the hardware elements against single-point faults and residual faults either by coverage from safety mechanisms or by design (primarily safe faults). A high SPFM implies that the portion of single-point faults and residual faults in the hardware elements is low. Table 4 shows the target values of SPFM for ASIL B to ASIL D. LFM is defined as
where ∑ HW, SR λ x is the sum of λ x of the safety-related hardware elements to be considered. LFM quantifies the robustness of the hardware elements against latent faults by coverage from safety mechanisms or by design. A high LFM implies that the portion of latent faults in the hardware elements is low. Table 5 shows the target values of LFM for ASIL B to ASIL D. In addition to SPFM and LFM, the probabilistic metric for hardware random failures (PMHF) could be calculated to evaluate the overall probability of violating the safety goal for the safety architecture at the system level. This is to provide evidence that the residual risk of a safety goal violation due to random hardware failures is sufficiently low. Although PMHF is usually calculated at the system level, a certain portion of the PMHF is sometimes assumed that should not be exceeded at the SoC level, and is provided for the information to system designers. PMHF can be calculated by using the failure rates from FMEDA or by quantitative FTA.
Dependent failure analysis
Dependent failure analysis (DFA) includes the identification and analysis of possible common cause and cascading failures between given elements, the assessment of their risk of violating a safety goal (or derived safety requirements) and the definition of safety measures to mitigate such risks if necessary. It is intended to evaluate potential safety concept weaknesses and to provide evidence of the fulfillment of requirements concerning independence or freedom from interference.
The dependent failures initiator (DFI) represents the root cause of dependent failures in safety scope. DFA addresses these DFIs, which are not addressable in the standard safety analysis, in a qualitative way. The types of DFIs include the following:
• failure of shared resources • single physical root cause • environmental faults • development faults • manufacturing faults • installation faults • repair faults.
Different measures need to be in place to address different types of DFIs.
Safety design implementation
Automotive SoCs spread through a wide range of products, including MCUs or microprocessors (MPUs), radar front-end monolithic microwave integrated circuits (MMICs), power management integrated circuits (PMICs), system basis chips (SBCs), and various sensors. A wide range of safety mechanisms are used in automotive SoCs depending on the product architectures. In this section, we categorize the safety mechanisms based on how they work and what they are intended for.
Safety mechanisms categorized by error detection methods
Many safety mechanisms rely on the capabilities of detection of faults and errors. There are roughly three types of detection methods, i.e., redundancy, monitoring, and test.
Error detection by redundancy leverages redundant computation or storage to detect whether there is an error in the target function. Three different types of redundancies are commonly exploited.
• Hardware redundancy usually executes the same computation on different hardware modules so that if there is a fault in the functioning module that leads to an error, it is likely to be detected by comparing the results with that of the redundant module. Dual module redundancy (DMR) is commonly used in automotive SoCs, where two processing units execute the same computation task at the same (in lockstep mode) and their results
Tutorial are compared by a checker module. DMR allows error detection but does not correct error by itself. The error handling is often done by other parts of the system. Triple module redundancy (TMR) allows both error detection and error correction at an extra cost. TMR is mostly used for SoC-level critical registers, e.g., registers storing trim values, implemented by triple voting flip-flops (TVFs).
• Information redundancy leverages the redundancy of information encoding to detect and sometimes correct the errors. Examples include error correction code (ECC) and parity check [16] . Such safety mechanisms are usually used to protect data communication channels and memory.
• Time redundancy executes the same computation repeatedly, possibly in the same hardware but using different algorithms. By repeating the computation, it is likely to detect if there is a soft error in the computation. In cases where diverse algorithms are used in the computation, it is even possible to detect permanent faults in the hardware as different algorithms might exercise different parts of the same hardware element.
Note that redundancy is often used in combination with diversity to avoid common cause failure. The diversity can be in time, algorithms, or physical implementation. Repeated execution of the same computation task with different algorithms is an example of diversity in algorithms. In DMR lockstep configuration, when the processing unit is duplicated, the physical layout of the redundant module is usually rotated for physical diversity. Also, in DMR, a delayed lockstep configuration is often pursued for diversity in time. Figure 4 shows a simplified block diagram of the delayed lockstep configuration of DMR. The input data feeds into the main processing unit directly while it is delayed one or two clock cycles before being fed into the redundant processing unit. The output of the main processing unit serves as the output to the system. It is also branched out to go through the delay block before being compared with the output from the redundant processing unit. The amounts of the delays from the two delay blocks should be the same. This configuration protects against common cause failures caused by clock glitches. Another error detection approach is to continuously or periodically monitor the critical parts or parameters for any anomaly. Monitoring usually assumes that the parts or parameters should behave within a preassumed normal range and flags the behavior that is not in the range. Examples of monitoring include monitoring of supply voltages, currents, clocks, or bus protocol interfaces. Another example is the software watchdog to monitor whether a processor unit hangs.
Test detects faults by running test patterns or programs and comparing the results with precomputed results. The main difference between test and monitoring is that monitoring is usually performed in parallel with the workload execution of the element. However, test is usually run in a test mode where the element is not actively executing workloads. Therefore, test could be destructive of the execution context compared with monitoring. Examples of test as safety mechanisms include logic built-in self-test (LBIST) [17] and memory built-in self-test (MBIST) [18] , which are usually run when the SoC is starting or shutting down. Another example is the loopback test in radar MMICs for testing the communication datapath connectivity [19] .
Safety mechanisms categorized by targeted faults
Another safety mechanism categorization can be based on the types of faults they target on. For example, safety mechanisms can be categorized based on whether they are intended to protect against single-point faults or latent faults. The safety mechanisms for single-point faults are the primary safety mechanisms as single-point faults can directly violate the safety goals by themselves. The safety mechanisms based on error detection by redundancy and continuous monitoring often fall into this category. Latent faults are often protected against by test-based safety mechanisms, e.g., LBIST and MBIST.
Test-based safety mechanisms are not always for detecting latent faults though. The reason why BIST mechanisms often cannot be used to protect against single-point faults is that the duration of BIST is usually longer than FTTI and due to their context-destructing nature, they usually cannot be run when the chip is executing workloads. Although this is usually true for MCUs, it is not always true for radar front-end MMICs, thanks to the characteristics of the radar working cycle. In some radar applications, each radar working cycle is divided into a chirping period and an RF-silence period. The MMIC transmits and receives data in the chirping period and is idle in the RF-silence period. Therefore, tests are usually run in the RF-silence period.
In some radar applications, the FTTI is considered to be one to two radar working cycles, which means that the test duration can fit within the FTTI. Moreover, MMICs perform functions as transmitters and receivers of radar waves and then send the captured data to MCU for processing. The bookkeeping of radar data often occurs on the MCU side. Therefore, the MMICs often can be considered as stateless and the test can be considered nonintrusive to the radar functions. Another categorization can be based on whether the safety mechanisms are intended to protect against permanent faults or transient faults, though some can protect against both. For example, ECC can detect errors caused by both permanent faults and transient faults. Test mechanisms, usually, can only be used to detect permanent faults as the effects of transient faults might already disappear when the test is running.
Safety verification
Safety verification makes sure the safety requirements are met by the implementation of the safety architecture. Note that the terms verification and validation defined in ISO 26262 are different from what the semiconductor design community refer to. In ISO 26262, methods of verification include review, walkthrough, inspection, simulation, formal verification, engineering analysis, and so on. Safety validation specifically refers to validation at the vehicle level. In this article, we focus on the presilicon verification (by simulation and formal methods) of the design models regarding the safety aspects of automotive SoCs.
One effective vehicle of functional safety verification is the fault-injection campaign (often called fault injection or fault campaign for short). It injects faults into the design model and observes the fault effects and fault reactions by safety mechanisms at observable locations. The results of fault-injection simulation can be used as evidence to support the diagnostic coverage of safety mechanisms claimed in FMEDA. The goals of fault injections include the following:
• confirming the diagnostic coverage of safety mechanisms • confirming the diagnostic time interval and reaction time • confirming the fault effects.
To setup fault-injection campaigns, the collateral includes the following:
• Design model: It could be at the register-transfer level/gate level, or even higher levels (e.g., SystemC).
• Fault sites and fault models: The list of faults can be randomly selected or come from identified critical failure modes.
• Functional stimulus: It should be representative of the workload or use case.
• Observation points: The points where the fault effects and diagnostics should be observed.
Based on the simulation results from fault-injection campaigns combined with expert judgment, the faults can be classified according to the categorization criterion in the "Faults in context of functional safety" section. Based on the fault categorization and diagnostic coverage, FMEDA reports can be updated with more accurate information from the detailed design implementation. We will discuss some of the technical challenges with fault injection in the next section.
Current and emerging challenges
In this section, we identify several challenges in the design and verification of the automotive SoCs to achieve functional safety for current and nextgeneration products.
Tradeoff between safety and PPA Functional safety is achieved at a cost. The implementation of safety mechanisms inevitably brings along overhead in performance, power, and area (PPA). For example, DMR increases the area of the module by over 1X. Running LBIST for the entire chip consumes excessive power. The existing safety analyses focus on faults and diagnostic coverage and do Tutorial not count for the design overhead in a quantitative way. Safety analyses and design space exploration regarding PPA are often conducted as separate processes. Although SoC architects and safety architects are aware of the tradeoff between functional safety and PPA, it is yet challenging to systematically analyze that to find the optimal architecture. Such analyses can start from the very early stage of architecture definition and be iterated through the design stages. The parameters in conventional design space exploration are already enormous, and adding an extra dimension of functional safety will make the problem even more challenging.
Challenges with fault-injection campaigns
Fault-injection campaign is an effective vehicle for verifying the effectiveness of the safety mechanisms and confirming the diagnostic coverage claimed in safety analysis. There are several major challenges with fault-injection campaigns today.
The fault universe is inherently enormous for modern automotive SoCs. If we use the low-level stuck-at fault models, given the size of today's automotive SoCs, there could be millions of faults in a reasonably sized IP block. With consideration of transient faults, the extra dimension of time makes the fault space even more intractable. Sometimes, tens or hundreds of tests are simulated to make sure the functional context of fault injection is representative of SoCs' real workloads, which makes it more computationally prohibitive.
Practically, manual selection and statistical sampling have been used to reduce the fault space. The limitation of manual selection is that it often requires expert judgment and deep know-how of the specific design. It is also challenging to figure out the probability distribution for manually selected faults for calculating diagnostic coverage. An often used statistical sampling method is sampling based on confidence levels and confidence intervals. The limitation is that it often gives a quite conservative bound when the confidence level is high and the confidence interval is narrow, and therefore, the sample size could be still quite large.
For digital circuits, research on fault simulation techniques has been conducted for at least three decades, and advanced algorithms are commercially available to speed up simulation by simulating thousands of faults concurrently [20] . However, for analog fault simulation, it is still very challenging, if not impossible to simulate the faults concurrently. There have been works on using sensitivity analysis to speed up the simulation in certain scenarios [21] . However, the general use of concurrent fault simulation in the analog domain is still an open question. For analog and mixed-signal circuits, the current simulation technologies focus on low-level fault models, which limit their use to simulation at a larger scope. Fault modeling at the right level of abstraction can be very helpful with enabling the fault simulation of analog and mixed-signal SoCs.
Formal verification methods have also been brought into fault injection in order to analyze the propagation and detection of faults. There are several limitations. The scalability of formal methods inherently prevents the application to large designs. In addition, if the environmental constraints of the design are not formulated properly, formal methods would often find unrealistic cases, which can become a black hole for engineering debug time.
Diagnostic coverage of test pattern based safety mechanisms
Nowadays, test pattern based safety mechanisms are getting more traction as they require fewer hardware resources and are more flexible. However, the development efforts and complexities of such test patterns can be challenging.
LBIST, a traditionally common test pattern based safety mechanism, has been seen with limitations for functional safety as of today. LBIST is intended to protect against latent faults. However, the construction of LBIST patterns is aimed at structural fault coverage and does not consider fault categorization in the context of functional safety. The PPA overhead of LBIST is huge. The usage of the scan chain makes the test time long, and thus, it is often challenging to meet customer requirements. Techniques to reduce test time would often result in excessive power consumption as they tend to increase simultaneous activities on the chip. Therefore, industry has been moving toward self-test mechanisms using functional test patterns as they are more flexible and lightweight and can be developed to target faults of interest. Efforts have been spent on developing software test libraries, which enable the user to not only run tests at reset but also run tests while the application is idle.
In spite of the obvious benefits, development of functional test patterns can be technically challenging. Functional test patterns cannot leverage design-for-testability (DFT) features such as scan chain and, thus, can be limited with regard to the controllability and observability of faults. Due to the lack of functional test generation tools, high engineering efforts could be spent on manually drafting the tests to achieve the desired diagnostic coverage. Research on DFT techniques to help with functional test generation is highly in demand for addressing such challenges.
Safety mechanisms for emerging accelerators
Accelerators occupy an increasing portion of the chip estate as domain-specific computing has been pervasive. In the automotive space, new accelerators are being designed for vision processing, processing of radar and lidar data, and deep neural network (DNN) inference. Unlike general-purpose components on an SoC, such as CPU, fabrics, and memory, accelerators are designed for computational tasks in a specific domain. Simplistic adoption of traditional safety mechanisms such as DMR to accelerators might not be effective and cost-efficient.
To effectively design the safety mechanisms for emerging accelerators, it will be beneficial to take advantage of the domain-specific nature of accelerators. It calls for thinking of hardware and software altogether and deep understanding of system-level safety mechanisms. Designing effective safety mechanisms for domain-specific accelerators is an open research area and innovations in this area are highly sought after.
Challenges with reaching fail-operational
The moving trend toward autonomous driving systems calls for future automotive E/E systems to be fail-operational. This requirement might also be passed down to the automotive SoCs, which means the SoCs can continue operating normally or in a degraded mode. Intuitively, this could be often achieved by redundant computation resources. However, as the automotive market is still a cost-sensitive one, the implication is that such fail-operational behavior should be achieved without incurring unacceptable resource increase.
Virtualization is one possible direction to pursue as it can provide high availability to achieve fail-operational behavior. It requires effective fault localization in hardware and handling the errors at the software levels. Although virtualization has proved successful in cloud computing, there are many open questions in applying it to automotive embedded applications.
Recent research highlights
This section highlights some recently proposed innovations and research works of enabling functional safety for automotive SoCs toward the era of autonomous driving. It is, by no means, a comprehensive review but the intent is to give the readers a flavor of some interesting works in this area.
Enabling application-specific safety mechanisms
As the violation of safety goals is closely related to the function of the item, the awareness of the function will make the safety mechanisms most effective for detecting and controlling faults at the system level. The challenges for automotive SoC development is that the details of the application are not visible, especially if it is an MCU developed as an SEOoC. Therefore, it is highly desirable if there are configurable and extensible mechanisms in SoCs that can be provided to Tier 1 suppliers to implement protection specific to the application.
A notable recent innovation is the Safety by Software concept, implemented by configurable safety mechanisms such as time-monitored comparator (TMC) and timed multiple-watchdog processor (TMWDP) [22] . TMC works in the context of software lockstep, where the same computational task is performed by two threads of software, most likely with different implementations. One software thread could be accurate and resource consuming and the other thread could produce less accurate results within a range by using less resource. Traditionally, the software lockstep would require that two software threads synchronize and compare their results periodically before they can advance, which can hurt performance. TMC improves the performance by using a dedicated hardware monitor to compare the results produced by two software threads within a controlled time interval. Therefore, it makes sure that results from the two threads are comparable and the advancement of the two threads are in a limited time interval.
TMWDP protects the integrity of the control flow of the application software. The assumption is that the system application developer would know the high-level control flow of the application software. The timed watchdog is a timed state machine translated from the control flow. It checks for incorrect state sequence and incorrect timing of a state sequence and starvation (application stays in a state for too long).
Safety of deep neural networks
With the recent breakthrough in deep learning applications in computer vision, DNNs have been gaining traction on the road from ADAS to autonomous driving. Tremendous research and development efforts have been spent on exploring and deploying DNNs for perception tasks such as pedestrian detection, vehicle tracking, road sign classification, and distance detection. Some have even experimented with using DNNs for end-to-end autonomous driving [23] . Dedicated accelerators have been developed to enable the deployment of DNNs for real-time applications. However, as always, safety concerns for such applications and special accelerators have been a critical question.
The use of DNNs consists of two phases: training and inference. Training refers to the process of finding the optimal model that fits the training samples without losing generality. The model is essentially the whole set of weights associated with neurons in the DNN architecture. The model can be stored on chip and loaded in the application to make prediction of a new data sample, which is referred to as inference.
The safety of DNNs comprises two aspects: safety of the intended function and functional safety. Safety of the intended function concerns the question: what is the safety impact if my DNN model classifies a stop sign as a speed limit sign and how can I mitigate that? Functional safety concerns the question: what is the safety impact if there is a defect in my DNN accelerator that changed its supposed behavior [24] ? Faults can occur during training or inference. As training is often done offline (as part of development), the faults in the training phase can be controlled by thorough verification. The failure effects and mitigation of random hardware faults during the inference phase are of major interest for research.
Recent works have explored the error propagation of faults in modern neural networks and proposed safety measures based on learning from experiments [25] - [27] . These works focus on studying the safety impact caused by architecture and design parameters of the inference engines. Alternatively, the safety impact caused by hyperparameters in the training process on the inference can also be explored. More specifically, dropout is a recent regularization technique in DNN training to avoid overfitting [28] . The idea of dropout is to disconnect a certain portion of connections between the neural network layers during training so that activation does not rely on few neurons. Dropout increases the information redundancy in the model and thus can potentially make the inference more fault resilient. It will be interesting to explore the implications quantitatively.
