Health Management for Self-Aware SoCs Based on IEEE 1687 Infrastructure  Fault tolerance has always been an important characteristic of dependable and mission-critical systems, such as mainframes, satellites, autonomous vehicles, and so on. However, there is a high price associated with the traditional methods of ensuring fault tolerance such as triple modular redundancy (TMR). Therefore, new methods like health management were employed, which try to intelligently detect and isolate faulty resources. Traditionally, these methods were employed in already complex equipment and could afford extra components to be dedicated to health management tasks [1] .
Continuing scaling and even higher integration in electronics allow fitting most components of a system onto a single SoC. They are increasingly prone to manufacturing and in-field defects (e.g., due to aging) because of more advanced and thin processing technologies. Therefore, health management and self-awareness became an important topic for SoCs [2] , [3] .
Rapid emergence of embedded instrumentation as an industrial paradigm and adoption of respective IEEE standard 1687 [4] by key players of semiconductor industry opens up new horizons in developing efficient test, debug, and health monitoring frameworks. The IEEE standard 1687, also shortly called Internal JTAG (IJTAG), has been initially started as an initiative to standardize access to on-chip embedded instrumentation, like monitors, sensors, and checkers as well as design-for-testability (DfT) infrastructure [5] . The IJTAG concept embraces the paradigm of reconfigurable scan networks [6] and has become a very attractive industrial solution for both scan-based manufacturing test and system debug [5] . An extension to IJTAG for system health monitoring and fault management has been proposed in [7] and [8] and further elaborated in [3] and [9] [10] [11] . In contrast to network-on-chips (NoCs) used in other approaches [2] , [3] for fault information gathering, IJTAG networks are significantly simpler, more robust, and standardized. 
Konstantin Shibin and Sergei Devadze
Tallinn University of Technology
Artur Jutman
Testonica Lab OÜ
Editor's note:
Motivated by the need to tolerate faults, this paper presents a complete fault management solution that includes fault detection and categorization, maintaining a map of faults, and modified scheduling and application algorithms for using healthy resources only. As the system maintains fairly sophisticated models of itself regarding faulty and healthy resources, it constitutes a good example of specialized self-awareness.
-Axel Jantsch, TU Wien
Martin Grabmann and Robin Pricken
Technical University of Ilmenau
Failure resilience mechanisms guarantee system's graceful degradation in the field under pressure of wear-out by enabling the system to be aware of its own health. One can see failure resilience as a combination of fault tolerance and fault management. It is based on fault tolerance concepts but goes beyond by localizing and classifying faults into, for example, transient versus permanent and critical versus low-priority ones [7] . Blocks with permanent faults are then either fully isolated or marked as reduced-capacity ones. In such a way, a system health map is maintained and used by the operating system (OS) to schedule tasks. As a result, the system becomes self-aware and health management is performed using gathered information.
In this article, a fault management architecture [9] based on IEEE standard 1687 is discussed and described with implementation details. In the following sections, we describe the architecture in general, fault classification categories and fault handling method. Finally, the feasibility of the approach is demonstrated in a case study where the fault management architecture is integrated into a system running Linux on a Xilinx Zynq SoC.
Fault management architecture
On-line fault management is a methodology to help an electronic system to acquire, timely update, maintain, and use the up-to-date information about the faults that can occur in various system modules. The main idea behind fault management system is to help the system to be self-aware of the faults inside itself and thus to continue functioning, albeit with some of the modules being nonoperational, instead of going out-of-order. Depending on the situation, the system can reschedule the tasks to other resources or start using spare one if it was provisioned for. On hardware level, the architecture is represented by special blocks called embedded instruments that are used to monitor and test functional hardware resources of the system. The software part is tightly integrated with the OS to allow for fault-management-aware process scheduling.
IEEE 1687 IJTAG as a backbone of fault management IEEE standard 1687 allows to create an efficient and regular network for continuously handling fault detection information as well as to manage test and system resources as a system-wide background process during the system operation. This feature is enabled by the fact that IEEE 1687 IJTAG network does not need or share any communication resources with the functional hardware; being designed as a relatively simple hardware with fault tolerance capabilities, it is intended to stay operational and perform system health management and diagnostic even if the functional part of the system is out-of-order and not responding.
Flexibility of IJTAG networks is based on the usage of Segment Insertion Bits (SIBs) that allow to dynamically change the configuration of the network by including and excluding its segments [4] .
The main benefit of using IEEE 1687 IJTAG infrastructure for in-situ fault management is based on considerable reuse of existing test and debug infrastructure and instrumentation later in the field for the new purpose of fault management.
Cross-layer approach
The proposed on-line fault management architecture is intended to be used in systems with CPU and OS where the scheduler can be influenced or modified to take advantage of the information provided by the fault management system. In this article, we continue the research outlined earlier by us in [10] in the direction of generalized cross-layer fault management methodology. The main idea of using IJTAG remains the same, but we now elaborate on system-wide integration that brings all important aspects together, including the handling of fault and health information on all system layers (hardware, OS, and application level). We aim to reach a system-wide health management and improve self-awareness by defining and exploiting the relationship between three layers of a typical SoC:
• hardware represented by the resources that are used to run the tasks, • OS and Fault Manager (FM) as its part, and • applications which can have different resource requirements and mission criticality.
The general concept is depicted in Figure 1 . We assume that the fault management framework operates on the same SoC as the target system itself. Its parts -FM and Instrument Manager (IM) -are closely coupled to the functional part of the system: FM is a service software that maintains the system health map (HM) and resource map (RM) and exchanges data with IM, which is implemented as a dedicated hardware. We assume that the rest of the SoC has an arbitrary structure but contains heterogeneous or identical IP cores, which are prone to degradation but capable of fully or partially replacing one another, hence providing room for graceful degradation.
Hardware layer
The actual error detection is taking place in-situ by embedded instruments/monitors. If a resource could not be automatically recovered from an error, Fault Management Infrastructure (FMI) should instantly pass an emergency event from the monitor to the OS so that the latter could react immediately [e.g., reschedule affected task(s)].
In [10] , we proposed a flag-based error reporting system where each block or submodule is provided with a dedicated set of status flags indicating the current fault detection status of the respective resource: F flag for fault detection and C flag for fault correction. All flags collectively form a hierarchical error indication and propagation structure tied with IEEE 1687 network for forwarding the status information from the flags to IM-the Asynchronous Fault Propagation Network (AFPN). The latter represents a significant contribution as the original IEEE standard 1687 does not consider such mechanisms. The asynchronous signals are hierarchically aggregated using logic gates to produce a signal for IM that will issue an interrupt in case of uncorrected error.
IM is a hardware module that is responsible for the communication with the instruments through IJTAG network. We implement IM as a fixed and relatively small finite-state machine, simplified view of which is shown in Figure 2 . The configuration of the network is stored in a special format in a read-only memory (ROM) that is filled with data during system design time when the IJTAG network configuration becomes known. ROM consists of N 14-bit words, where N is the total number of SIBs and data registers in the network that needs to be addressed by IM.
Besides regular instrument access requests from FM, IM is responsible for reacting to the fault flags set by the instruments and propagated as an asynchronous emergency signal. IM automatically opens the path to the instrument that raised the fault flag and provides the information about its location to FM. However, IM can only provide coarse fault localization in this manner.
OS layer
The core of the fault management architecture is contained in FM that is a part of OS kernel and is responsible for maintaining HM and RM, performing fault classification, and communicating with IM hardware.
Health map is a linked list data structure (an example in Figure 3 ) that is central to the architecture and holds the detailed information about the system's resources (blue and light blue boxes), 
Self-Awareness in Systems on Chip 2017
related to the instruments (green) and the faults (red) that have occurred previously and their classification. HM maintains the statistics of fault occurrences (orange) in resources for a better reliability prediction capability. This information, available through fault classification, helps the OS scheduler to decide which resource is at the current moment more reliable and where the most critical task should be run. Since hardware faults may not disappear after system restart, HM also should not be lost. HM should be stored in a reliable nonvolatile memory to maintain the prediction capability across power cycles. To facilitate that HM structure is organized in a way that is easy to serialize for storage and where new fault detection entries are always appended to the end of the occupied memory (see Figure 3) .
Resource map is a data structure in the system memory that holds the information about the currently available (healthy) resources of the system. It should be modified on the fly during system's normal operation, should a fault be detected by an instrument or a diagnostic routine.
Task scheduling is performed by an OS scheduler that takes into account the information from RM. This can be achieved in two ways: either the scheduler is designed/modified to read RM or it can be instructed through a special interface. Scheduler must analyze RM and select the resources to be used for task execution based on:
• subresource availability [e.g., floating point unit (FPU) inside a CPU], • information in the task descriptor, • reliability of the resource based on the fault statistics, and • mission-criticality of the task.
Application layer
In order to efficiently assign available hardware resources for task execution, scheduling algorithm can take the requirements of the tasks into account. For example, if it is known that a task requires a specific CPU core feature (e.g., FPU), then it should not be scheduled to a core that lacks this functionality (e.g., due to a permanent fault). We propose to store this information in a task descriptor file that contains the information about resource requirements for the task execution and is related to the executable file of the task using the same name.
Fault tolerance of FMI components
The operability of the Fault Management architecture is crucial for system's correct reaction to the occurring fault events; therefore, the components of FMI itself should be well protected against possible faults. Due to the relative simplicity of the hardware part of FMI, it is possible to apply traditional and more expensive fault tolerance methods, such as TMR or hardening. The software part also needs to be protected against faults, because it is being run on a regular CPU core of the system, for example, by means of redundant execution (temporal or spatial). Recent work [12] covers this topic in more detail.
Fault classification
Ability to classify errors, malfunctions, and faults is an important basis for health map management, effective system recovery, and fault management. In [11] , we proposed to classify the faults according to their severity levels and their contribution to the permanent malfunction of system's components and modules. In particular, we consider the following properties: persistence, severity, criticality, diagnostic granularity, and location. This classification has a strong relation to fault management processes and the architecture of the health map. In this work, we use the following categories:
• Persistence: Transient, intermittent, and permanent.
• Severity: Faults can be different in their influence on the resource: from benign to severe (e.g., failed program counter in a CPU core).
• Criticality: Depending on the resource where the fault has occurred, its consequences for operability and stability of the system as a whole can span from none to total system failure.
• Diagnostic Granularity: Precision of fault location depends on how the fault was found -either by an instrument, during the diagnostic routine or using high-level fault detection methods.
• Fault location: Localization of the fault occurrence as a result of fault detection or fault diagnosis procedure, for example, instrument position in IJTAG network.
The exact procedures for classification of faults may be different and should depend on the policies of the designed system where the fault management support is integrated. Therefore, given topic is not covered in this paper.
Fault handling method
When a fault occurs in a complex SoC working under the control of an OS, it is necessary that the latter becomes aware of the fault as quickly as possible. The OS must then take actions to isolate and mitigate the effects of the fault.
There are various ways available to detect a fault in a resource of the system: OS-controlled temporal and modular redundancy of task execution, Power-On Self-Test checks, and others. In this work, we concentrate on the faults that are detected online (during normal operation of the system) by built-in instruments.
For a generic case of a fault being detected by an instrument, the flowchart with four actors (Instrument, IM, FM, and OS) is shown in Figure 4 . In the following section, we describe this process in more detail.
Fault detection
Whenever a fault is detected by an instrument, the information about this event is quickly passed to IM through AFPN. In response, IM sends an interrupt request that is served by FM inside OS kernel. Since the nature of the fault is not known at this stage yet, FM stops execution on all CPU cores in order to contain the possible error propagation.
Fault localization
Concurrently with sending the interrupt request, IM starts the instrument localization procedure. IM will subsequently open the hierarchical IJTAG network segments where the Fault flag is set. When an instrument that has raised the flag will be reached, IM can report the location of the fault as the position in the IJTAG network.
Coarse-grained fault classification
The information about the instrument location is used by FM to perform coarse classification of the fault. Since the detailed diagnostic information about the fault is not available at this time, the diagnostic granularity is limited to the instrument location.
System response
Based on the information derived in the previous step, the OS may need to take actions to mitigate the effects of the fault on the functional operation of the system. The fault can be ignored if it does not affect the operation or the task was not critical. Alternatively, the task can be rescheduled to another resource or re-executed on the same one. After the required actions are taken, CPU cores are released.
Diagnosis
Depending on the outcome of the coarse classification step, FM may decide to get more detailed diagnosis information and, for this, the diagnostic procedure is launched. In this case, information must be sent or received to/from the diagnostic instruments (such as BIST or other DfT hardware) connected to IJTAG network.
Fine-grained fault classification
If some additional information is acquired as a result of the diagnostic procedure, FM can perform fine fault classification and update both HM and RM.
Case study: Augmenting Linux with fault management support
In order to validate our health management approach, we conducted a proof-of-concept case study on a real multicore processing system.
In general, the following modifications in system architecture are required to enable fault and health management features. The hardware part should implement IM module, and also the respective IJTAG network for connecting a set of embedded instruments with IM should be inserted. The hardware platform should contain at least several CPU cores able to run OS. Based on these requirements, we have selected Xilinx Zynq7000 platform as it includes both programmable logic and an industry-standard CPU cores. In this case study, the CPU cores and their submodules (CPU cache, FPU, and Digital Signal Processor) are considered as the objects of health management. The instruments responsible for monitoring the health of the cores were emulated and together with the IJTAG instrumentation network were inserted into the programmable logic part of the Zynq platform. The network consisted of two hierarchical layers and six instruments connected to it. The experimental results can also be extrapolated to larger networks. In [9] , it has been shown that fault localization latency has logarithmic dependency on the network size.
The software part consists of an OS that is enhanced with fault management support. For our case study, the OS needs to meet the following requirements:
• multicore support and • open and well-documented kernel.
Multicore support is essential for this case-study since only in a multicore platform it is possible to continue the operation if one of the processing cores fails. Health management support on OS level requires modifications of the kernel sources/modules. Hence, the OS and its inner architecture should be open and well documented. Otherwise, it is difficult to gain the required knowledge for making necessary modifications.
Linux is very popular OS for all kinds of computing applications and hardware platforms. It has built-in support for symmetric multiprocessing which is targeted to homogeneous multicore architectures. For our purposes, advantage of Linux is that its kernel provides an easy interface for controlling the task scheduler. For this case study, we used Petalinux 2015.4.
Fault management kernel module
On the software/OS side, the primary tasks of fault management are carried out by specially developed FM software in the form of a loadable kernel module, which is the central part of the architecture. Its structure is shown in Figure 5 . A list of managed processes is stored in the kernel module, it contains process identifiers, and the associated record of resources (modules/submodules) needed for execution of each managed process. With the information from the process list and HM, the system is always aware of how to schedule the tasks to available resources.
In this case study, resource-aware scheduling is implemented with unmodified Linux scheduler. It is possible using sched_setaffinity() system call that allows to explicitly define a subset of cores that should be used to execute a certain process. Therefore, in this setup, RM exists in an implicit form of the CPU affinity information that is provided by FM to Linux scheduler.
Communication
In the event of an error detection, IM raises the high interrupt request (IRQ) line as soon as an unrecovered fault condition (flags F = 1 and C = 0) is signaled by an instrument. It is served by an interrupt service routine (ISR) that sets the CPU cores on hold and waits until the low IRQ is raised. This happens when IM has finished the fault localization procedure. Now the kernel module reads the status information from the shared register and subsequently updates HM.
ISR high
This ISR stops all other cores (using a mutex) to prevent a task running on a faulty core, when an unrecovered fault has been detected by IM. For this, first, the ISR locks the mutex, afterward, it blocks all other cores by means of waiting for the same mutex. After the location of the faulty resource becomes known, the healthy cores are released by the ISR low and can continue their operation.
ISR low
When IM has finished the localization procedure or wants to transfer other information to the FM application, it triggers the low IRQ and the corresponding ISR is invoked. Depending on the information received from IM, ISR may add an entry to HM and call the classification procedure. The affected processes are rescheduled if required.
Experiments
We conducted a measurement of reaction times of the solution proposed for this case study to find out how fast the system can react to an unrecovered fault. We also compared our results with the experimental data presented in [3] . For the experiment, we emulated a fault in one of the CPUs in a controlled manner, by rewiring the emergency signals (F, C) of the respective instrument to push-buttons on the board, and measured the time required for different stages of the system's reaction to this fault on real hardware. The time when the flag is raised is denoted by t F . Measurements were made for the following fault handling stages:
• hardware detection latency-the time required for IM to raise high IRQ (t hwdet -t F ), • hardware localization latency-the time required for IM to automatically locate the instrument which has raised the F flag and raise low IRQ (t hwloc -t F ), • OS interrupt latency-the time required for OS toreact to high IRQ and stop the CPU cores (t halt -t hwdet ), • localization and classification latency -the time required for FM kernel module to localize, coarsely classify the fault, and resume the execution (t resume -t halt ), and • total time required to handle a fault (t resume -t F ) CPU cores were running at 667 MHz while IM and IJTAG network at 100-MHz clock. Since the interrupt latency of standard Linux kernel is nondeterministic and can vary from run to run, the results of the third and fourth stages are averaged over 10 runs. The experimental results are presented in Table 1 that shows fault detection latencies for every stage in clock cycles and microseconds.
From the measurement results, it can be seen that the OS interrupt latency has the dominating contribution to the overall reaction latency. Therefore, methods for reducing this part need to be evaluated and applied in the future work to further improve the performance of the fault management architecture. Compared with the results from [3] , where the fault detection time (a + c) was reported to be 21 ms (versus 7.24 µs in our case) and the equivalent of (e) at hundreds of milliseconds (versus 13.07 µs in Table 1 ), our approach shows orders of magnitude of improvement. this paper describes an efficient health management approach for a self-aware system that combines monitoring facilities and emergency signaling infrastructure that enables quick data collection for system health monitoring. The backbone of the proposed health management concept is the fault handling infrastructure built upon the IEEE Standard 1687, enabling cost-efficient reuse of embedded instrumentation in the field.
First, we describe the detailed FMI including FM, IM, HM, and RM, and explain their roles in health management tasks. The main contribution of this paper is twofold: first, the implementation details of the proposed architecture are discussed. Second, the feasibility and operation of the health and fault management infrastructure is demonstrated in a case study based on a SoC with industry-standard hardware and software (Zynq and Linux OS).
Finally, to measure the actual reaction latency to the occurring unrecovered fault, we used this platform and measured the time required for different fault handling stages in both hardware and software. The measurements show that the proposed fault management architecture allows a health-aware system to quickly recover from a malicious fault and continue to stay operational. 
