This paper is a part of a project aiming to develop supervisor and monitoring devices for embedded systems in airplanes and vehicles. It focuses on the reliability of these systems and establishes a monitoring framework to detect drifts and faults in the behavior of the heterogeneous central processing units (CPU) and graphics processing units (GPU) chips powering them. In this work, we use a previously developed incremental model of these chips and associate it with a fault detection algorithm. Estimations from the model constitute inputs to the diagnosis module. The latter generates alarms in the presence of faults or drifts in the characteristics and features of the System-on-Chip (SoC). The obtained results validate the proposed monitoring algorithm and demonstrate the effectiveness of the fault detection algorithm.
INTRODUCTION
The study of the reliability of airborne or wheeled transportation machinery has focused mainly on the moving parts of the vehicle. With the integration of new smart technologies and the moves towards electric energy to power these vehicles, diagnosis and reliability studies needed to be adapted to take into account multiple engines setup, the batteries, and hybrid system control that came along. Moreover, the development of onboard driving assistance devices and autonomous vehicles has motivated the scientific community to develop monitoring algorithms for embedded electronic systems. These Systems-on-Chips (SoCs) are generally embedded in a complex environment, with cycles of heating and cooling related to the operation of the engines, as well as vibratory conditions with high variability, which might cause accelerated aging of these devices compared to the average lives announced by the manufacturers. The purpose of this paper is to develop and test a monitoring scheme for main components of the SoCs embedded in safety-critical systems i.e. central processing units (CPU) and graphics processing units (GPU) .
A majority of the existing works in this field are based on causal models, like the directed graph by Zhang (2005) or the fault tree by Wang et al. (2011) , for instance. In such models, for the system to function properly all of its components have to be fully and correctly operational Steininger (2000) . However, an enhanced depen-The review published by Gizopoulos et al. (2011) is a thorough study of online error detection works done on multicore processors. These approaches are classified into four main categories: redundant execution Aggarwal and Ranganathan (2007) ; LaFrieda et al. (2007) ; Mukherjee et al. (2002) , periodic Built-In Self-Test (BIST) approaches Shyam et al. (2006) , dynamic verification approaches Austin (1999) ; Meixner et al. (2007) , and anomaly detection approaches Wang and Patel (2006) ; Li et al. (2008) . In one of the main conclusions of this review, it showcased the success of the dynamic verification approaches in detecting both transient and permanent faults, and also design bugs. A more general overview of all diagnosis and fault-tolerant techniques can be found in an extensive survey established by Gao et al. (2015a,b) . This paper is a follow up on the work we presented in Djedidi et al. (2017) , in which we proposed and validated and incremental interconnected model that describes the dynamics of the frequencies, the voltages, the temperatures, and the power consumption of an ARM-based SoC. The processor used the most in autonomous vehicles. In this work, we use this model and its outputs as inputs for the monitoring algorithms for the early detection of faults and drifts in the system.
The study of the reliability of airborne or wheeled transportation machinery has focused mainly on the moving parts of the vehicle. With the integration of new smart technologies and the moves towards electric energy to power these vehicles, diagnosis and reliability studies needed to be adapted to take into account multiple engines setup, the batteries, and hybrid system control that came along. Moreover, the development of onboard driving assistance devices and autonomous vehicles has motivated the scientific community to develop monitoring algorithms for embedded electronic systems. These Systems-on-Chips (SoCs) are generally embedded in a complex environment, with cycles of heating and cooling related to the operation of the engines, as well as vibratory conditions with high variability, which might cause accelerated aging of these devices compared to the average lives announced by the manufacturers. The purpose of this paper is to develop and test a monitoring scheme for main components of the SoCs embedded in safety-critical systems i.e. central processing units (CPU) and graphics processing units (GPU).
A majority of the existing works in this field are based on causal models, like the directed graph by Zhang (2005) or the fault tree by Wang et al. (2011) , for instance. In such models, for the system to function properly all of its components have to be fully and correctly operational Steininger (2000) . However, an enhanced depen-This work is a part of the MMCD project supported and funded by the Banque Publique d'Investissement (BPI), to whom we address our thanks. dency model presented by Cui et al. avoids the shortcomings of the existing works mainly by allowing for the disregarding and the elimination of multiple faults, by including symbol switches representing mechanisms to disconnect one part from the main body of the model in their dependency graphical model (DGM).
The review published by Gizopoulos et al. (2011) is a thorough study of online error detection works done on multicore processors. These approaches are classified into four main categories: redundant execution Aggarwal and Ranganathan (2007); LaFrieda et al. (2007) ; Mukherjee et al. (2002) , periodic Built-In Self-Test (BIST) approaches Shyam et al. (2006) , dynamic verification approaches Austin (1999) ; Meixner et al. (2007) , and anomaly detection approaches Wang and Patel (2006) ; Li et al. (2008) . In one of the main conclusions of this review, it showcased the success of the dynamic verification approaches in detecting both transient and permanent faults, and also design bugs. A more general overview of all diagnosis and fault-tolerant techniques can be found in an extensive survey established by Gao et al. (2015a,b) . This paper is a follow up on the work we presented in Djedidi et al. (2017) , in which we proposed and validated and incremental interconnected model that describes the dynamics of the frequencies, the voltages, the temperatures, and the power consumption of an ARM-based SoC. The processor used the most in autonomous vehicles. In this work, we use this model and its outputs as inputs for the monitoring algorithms for the early detection of faults and drifts in the system.
A majority of the existing works in this field are based on causal models, like the directed graph by Zhang (2005) or the fault tree by Wang et al. (2011) , for instance. In such models, for the system to function properly all of its components have to be fully and correctly operational Steininger (2000) . However, an enhanced depen-This work is a part of the MMCD project supported and funded by the Banque Publique d'Investissement (BPI), to whom we address our thanks. The review published by Gizopoulos et al. (2011) is a thorough study of online error detection works done on multicore processors. These approaches are classified into four main categories: redundant execution Aggarwal and Ranganathan (2007); LaFrieda et al. (2007) ; Mukherjee et al. (2002) , periodic Built-In Self-Test (BIST) approaches Shyam et al. (2006) , dynamic verification approaches Austin (1999) ; Meixner et al. (2007) , and anomaly detection approaches Wang and Patel (2006) ; Li et al. (2008) . In one of the main conclusions of this review, it showcased the success of the dynamic verification approaches in detecting both transient and permanent faults, and also design bugs. A more general overview of all diagnosis and fault-tolerant techniques can be found in an extensive survey established by Gao et al. (2015a,b) . This paper is a follow up on the work we presented in Djedidi et al. (2017) , in which we proposed and validated and incremental interconnected model that describes the dynamics of the frequencies, the voltages, the temperatures, and the power consumption of an ARM-based SoC. The processor used the most in autonomous vehicles. In this work, we use this model and its outputs as inputs for the monitoring algorithms for the early detection of faults and drifts in the system.
A majority of the existing works in this field are based on causal models, like the directed graph by Zhang (2005) or the fault tree by Wang et al. (2011) , for instance. In such models, for the system to function properly all of its components have to be fully and correctly operational Steininger (2000) . However, an enhanced depen-This work is a part of the MMCD project supported and funded by the Banque Publique d'Investissement (BPI), to whom we address our thanks. The review published by Gizopoulos et al. (2011) is a thorough study of online error detection works done on multicore processors. These approaches are classified into four main categories: redundant execution Aggarwal and Ranganathan (2007) ; LaFrieda et al. (2007); Mukherjee et al. (2002) , periodic Built-In Self-Test (BIST) approaches Shyam et al. (2006) , dynamic verification approaches Austin (1999) ; Meixner et al. (2007) , and anomaly detection approaches Wang and Patel (2006) ; Li et al. (2008) . In one of the main conclusions of this review, it showcased the success of the dynamic verification approaches in detecting both transient and permanent faults, and also design bugs. A more general overview of all diagnosis and fault-tolerant techniques can be found in an extensive survey established by Gao et al. (2015a,b) . This paper is a follow up on the work we presented in Djedidi et al. (2017) , in which we proposed and validated and incremental interconnected model that describes the dynamics of the frequencies, the voltages, the temperatures, and the power consumption of an ARM-based SoC. The processor used the most in autonomous vehicles. In this work, we use this model and its outputs as inputs for the monitoring algorithms for the early detection of faults and drifts in the system.
A majority of the existing works in this field are based on causal models, like the directed graph by Zhang (2005) or the fault tree by Wang et al. (2011) , for instance. In such models, for the system to function properly all of its components have to be fully and correctly operational Steininger (2000) . However, an enhanced depen-This work is a part of the MMCD project supported and funded by the Banque Publique d'Investissement (BPI), to whom we address our thanks. 2002), periodic Built-In Self-Test (BIST) approaches Shyam et al. (2006) , dynamic verification approaches Austin (1999) ; Meixner et al. (2007) , and anomaly detection approaches Wang and Patel (2006) ; Li et al. (2008) . In one of the main conclusions of this review, it showcased the success of the dynamic verification approaches in detecting both transient and permanent faults, and also design bugs. A more general overview of all diagnosis and fault-tolerant techniques can be found in an extensive survey established by Gao et al. (2015a,b) . This paper is a follow up on the work we presented in Djedidi et al. (2017) , in which we proposed and validated and incremental interconnected model that describes the dynamics of the frequencies, the voltages, the temperatures, and the power consumption of an ARM-based SoC. The processor used the most in autonomous vehicles. In this work, we use this model and its outputs as inputs for the monitoring algorithms for the early detection of faults and drifts in the system. By definition, the role of the monitoring subsystem is to flag errors and irregularities found in the surveilled variables and features. It also exploits the incremental structure of the model and the specialized nature of each subsystem (each one estimate only one variable) to detect and isolate faulty components.
The main advantage of the hereafter proposed monitoring algorithm, is its reliance on data already provided by the system itself. Thus, it can be deployed on all current and forthcoming SoCs, after model training. Once running, one can easily follow its fault indicators to monitor over the state of the device, intercept errors, investigate the effect of these errors, and even view wear-traits for predictive maintenance planning and remaining useful life calculations.
In the next section, the general proposed monitoring approach is presented. In section 3, we explain the fault detection and isolation (FDI) algorithm, detailing residual generation and evaluation, and then illustrating the decision-making process. Section 4 is dedicated to the presentation and discussion of the obtained experimental results, where we validate the FDI algorithm by analyzing residuals in normal and faulty scenarios. Finally, the last section is a conclusion highlighting the results of this paper.
GENERAL METHODOLOGY
In the introduction, we mentioned monitoring and FDI methods that rely upon-amongst others-built-in tests, redundancy, or verification. The method hereafter described is a complementary one; it monitors the system to provide an early detection of drifts in its functions caused by wear and over-solicitation. Moreover, analysis of these drift phenomena may allow, in addition to the early detection of faults, the study of the life-cycle of the system and factors accelerating its wear. Fig. 1 is a diagram of the monitoring algorithm we applied to a heterogeneous SoC. In this algorithm, the universal inputs for both the system and the estimation model are the CPU and GPU loads and the Memory (RAM) Occupation Rate (MOR). The load is defined as the relative busy time of the processor during a sampling period in percent. As for the MOR, we define it as the ratio of the occupied RAM relative to its full size. These inputs allowed us to construct an estimation model (see Djedidi et al. (2017) ). This model is built in a modular structure, as a set of interconnected subsystems. In the first set of modules, frequencies percore and-consequently-voltages per-core are calculated according to the present computational load. They are then applied to the second set of modules, in which they are used to estimate the power consumption and temperature of the SoC. The detailed modeling process of each of the subsystems along with their validation and the advantages of this model are presented by Djedidi et al. (2017) .
The aforementioned variables characterize the operating state of the SoC and are used as inputs to the monitoring algorithm (c.f. Fig. 1) . The algorithm is based on analytical redundancy. In this technique, outputs from the system are compared to those from the estimation model which in this case is called a reference model. During normal operations, outputs from the system are equal-within a margin of error-to those from the model. If a fault or an error occurs, these outputs will diverge from each other. The differences between the set of the two outputs are fault indicators commonly known as residuals. The latter is then processed by signal processing and probabilistic techniques in order to avoid erroneous decisions due to modeling uncertainties.
Although analytical redundancy has been widely used for fault diagnosis in systems without software components, to our knowledge, it has not been used previously to monitor system with both hardware and software components.
Finally, thanks to the modular structure of the proposed estimation model, each generated residual is associated with clearly identified modules, allowing an easy isolation of faulty subsystems.
RESIDUALS PROCESSING
The monitoring method, in this work, relies on the processing of residuals which consists their generation and evaluation.
Residuals generation
Raw residuals are generated from the difference between measured and reference values from the model (c.f. Fig. 1 ), and are computed as shown in equation 1.
i specifies the GPU or the core number of the CPU, while the the subscripts estm and meas denote model estimations or measured values, respectively. These residuals are the dimensionless estimation errors and are the appropriate choice for the detection of deviations in the functioning of the system, especially progressive ones which are the main indicator of degradation.
Residuals evaluation
During normal operations, theoretically, estimations are equal to measurements, and residuals are equal to zero. In this case, non-zero residuals would only occur, if the outputs of the system deviate from those of the of the reference model when an unforeseen event or a problem occurs. But, on a real system, such non-zero residuals would also occur due to estimations errors. Hence, to avoid false flags, residuals also undergo an evaluation process.
Since the residuals originate from measurements and estimations, we use signal-based approaches to evaluate them, Signal-based methods ignore the origin of the signal-in this case, the residuals-but rather use its statistical and probabilistic properties of in normal operation as a reference to make a decision when an abnormal behavior occurs (generate an alarm, for instance) Djeziri et al. Evaluation of the frequency and voltage residuals Since these residuals are mainly equal to zero except for the spikes that are due to the estimation lag ( Fig. 2a and Fig. 3a) . To avoid the false alarms caused by this lag, we set a maximum tolerated delay value τ ref from the data, and use it to calculate normalized residuals R f and R V (Equation 2) .
x = {f, v} stands for either the frequency or the voltage, and τ x d is the value of the current measured delay.
Evaluation of the power and temperature residuals Fig. 4a and Fig. 5a clearly display a high-frequency noise in r P and r T . The average of this noise is close to zero, which proves that it is mainly due to estimation errors, considering that drifts of characteristics manifest mainly in the average value of the signal. Henceforth, rather than raw residuals, the moving average of the residuals is used to generate alarms. The moving average is the mean of the signal value in a window of n samples. Furthermore, the normal distribution law states that 99% of the signals population will be bounded in an envelope between two thresholds values that are the positive and negative values of the mean (µ) plus threefolds the standard deviation (σ). Hence, power residuals become (Temperature averaged residuals and thresholds are obtained using the same formula): 
The averaged residuals are then normalized into R P and R T as follows:
T } stands for the power or the temperature.
Isolation of faults
In this work, fault isolation is a direct consequence of the nature of the model. Indeed, since each subsystem is only implicated with one estimation, faults will be first reported by the faulty component's subsystem in its alarms.
After the algorithm detects a fault, it can be either isolated or left to propagate and analyze its effect. For the purpose of this work, faults are isolated by the replacing outputs from the faulty subsystem's model by direct readings allowing for the rest of the subsystems in the model to continue generating the same outputs as measured.
EXPERIMENTAL RESULTS AND VALIDATION

Test of the monitoring algorithm
For experimental validation, we used two test boards; a commercial ARM-based system and a test and development board. The commercial ARM-based system is equipped with a SoC that has a quad-core ARM processor with variable frequencies ranging between 0.3-2.45 GHz, a GPU with frequencies ranging between 200 -578 MHz. The SoC is covered by the system's 2 GB low power DDR3 RAM. The test and development board has a one core ARM-based CPU and 1 GB of RAM. Fig. 2 and Fig. 3 addition of the delay thresholds eliminated all spikes, and no false alarms were recorded.
As it was the case with the frequency and voltage residuals, Fig. 4 and Fig. 5 display the raw residuals r P , the added averaged residuals r m P and normalized residuals R P in normal operation, and demonstrate the effectiveness of the processing method for both the power and temperature residuals, considering how it prevented the rise of false alarms by 0.9% of the residuals that were out of the threshold envelope.
Faulty scenarios
In order to validate the monitoring algorithm, we tested it against two faulty scenarios. The experimental results presented in this section were obtained with data recorded during those scenarios, which are chosen to simulate faults originating from the environment and or signs of wear of the system. Environmental faults Faults caused by the environment generally manifest in the form of the overheated or overcooled surrounding. Being the most common one, overheating can be caused by a multitude of reasons ranging from a faulty cooling system to electrostatic charge, and even radiation.
In this scenario, the mobile was sealed in a waterproof bag and submerged into an 80 • C hot water bath. Fig. 7 . Values of the residuals r P , r m P and R P during the faulty power scenario experiment.
In Fig. 8 , measured values start diverging from estimated ones at around t = 178 s, about 20 s after the submersion of the phone into the water. Fig. 9 displays the profiles of the residuals during the overheating. The latter is detected around t = 183 s where the residuals r m T goes beyond the normal operating envelope. Then, an alarm is generated (R T rises from 0 to 1).
CONCLUSION
A data-driven approach is proposed in this paper for the detection and isolation of drifts in the characteristics of embedded electronic SoC. The method is based on the creation of redundancy through qualitative models for which each characteristic was built in an incremental interconnected structure, that ensures good isolation of faults.
Drift indicators are then generated by comparing the actual output of the system with the reference output generated by the model. Further residual treatment allows for the algorithm to generate normal operation thresholds. These thresholds, if surpassed, would indicate the presence of a drift in the behavior of the system.
The experimental results obtained on a CPU-GPU SoC show the ease of implementation and the effectiveness of the proposed approach.
