Abstract. The paper addresses the issue of reliability of complex embedded control systems in the safetycritical environment. In this paper, we propose a novel approach to design controller that (i) guarantees the safety of nonlinear physical systems, (ii) enables safe system restart during runtime, and (iii) allows the use of complex, unverified controllers (e.g., neural networks) that drive the physical systems towards complex specifications. We use abstraction-based controller synthesis approach to design formally verified controller that provides application and system-level fault tolerance along with safety guarantee. Moreover, our approach is implementable using commercial-off-the-shelf (COTS) processing unit. To demonstrate the efficacy of our solution and to verify the safety of the system under various types of faults injected in applications and in underlying real-time operating system (RTOS), we implemented the proposed controller for the inverted pendulum and three degree-of-freedom (3-DOF) helicopter.
Introduction
With the increased use of embedded systems in various safety-critical environments, these systems are expected to provide high-performance with reliability. Delivering high-performance drives the need for more complex systems 1 . As a consequence, it increases the possibilities for errors and makes formal verification more challenging.
In the safety-critical environment, the use of complex embedded control systems where one or more control applications run on top of the real-time operating system (RTOS) and share the resources may lead to safety violations 2 due to various software level faults. There are two main root causes of such faults. First, the control application may issue a set of unsafe commands due to the incorrect logic (bugs) or fail to generate any commands at all (referred to as application-level faults). Second, even with a bug-free control application, faults in underlying software layers such as the RTOS can disrupt the execution of the controller and jeopardize the safety (usually referred to as system-level faults). Ideally, all the components of these systems including the RTOS must be formally verified to ensure that they are fault-free. However, due to the high complexity, formal verification of the entire platform is very difficult. Therefore, designing architectures that enable the system designers to utilize components such as RTOS and vendor drivers without requiring to prove their correctness to guarantee safety is very important.
In this work, we proposed a novel approach to design controller that provides safety guarantee on the physical component in presence of application-level and system-level faults. Moreover, the proposed solution provides fault-tolerance and liveliness guarantees only using one commercial-off-the-shelf (COTS) computing platform. The proposed approach uses full system restart to recover from such application and system-level faults. However, restarting in the safety-critical environment is very challenging. In our previous work [ATR + 17], we provided a solution for designing controllers ensuring fault-tolerance and safety for linear physical systems. Designing such controllers become very challenging if the physical components exhibit nonlinear dynamics (which is the case in most of real applications). To address this, in this paper we provide a procedure for the synthesis of abstraction-based correct-by-construction controllers for nonlinear physical systems that enables the entire computing system to be safely restarted at runtime. This controller can keep the nonlinear control system inside a subset of safety region, only by updating the actuator input at least once after every system restart. In this paper, we refer to this controller as Base Controller (BC).
Restarting a system is an effective approach for recovery from unknown faults at runtime with a very predictable outcome. As soon as a fault occurs that disrupts the execution of critical software components, a hardware watchdog timer (WD) restarts the system. During a restart, a fresh image of all the software (middleware, RTOS, and applications) is loaded from a read-only storage which recovers the system into an operational state. Prior to this work, restarting was proposed as a way to increase the availability of non-safety critical systems [CF01, GA03, CKZ + 03, CKF + 04, VT05, GPTT95, HKKF95]. Alongside, partial restarting of safetycritical systems using extra hardware was investigated in [AMB + 16, BCA + 09]. To the best of our knowledge, this is the first work that proposes safe restarting of the entire system in a safety-critical environment containing nonlinear physical components.
Having only BC and the WD mechanism which enables restarting, allows the system to remain safe, tolerate faults and recover from them. However, it does not make any progress towards its mission goal. To address this issue, BC is complemented with a Mission Controller (MC) (e.g., a neural network) and a Decision Module (DM). The MC is an unverified, high-performance, complex controller that drives the system towards the mission setpoints. It may contain unsafe logic or bugs that jeopardize safety. To maximize the progress towards the mission goals, in every control cycle, DM checks the MC command. If it satisfies the safety requirements, DM allows it to be sent to the actuators. Otherwise, BC command is applied to the system. By doing so, MC drives the system for as long as possible, and, whenever it is not possible, BC takes the control. The logical view of this design is depicted in Figure 1 .
In the proposed design, the only components that need to be verified for correct functionality are BC, DM and Flushing Task. Any fault in the system software (System-Level or Application-Level) that results in a fail-silent failure (also known as fail-stop) of these two components leads to WD triggering a system-wide restart and recovery. However, our design does not protect the system from faults that alter the logic of BC or DM at execution times. In summary, this design enables the system to provide formal safety guarantees by verifying only the correctness of BC, DM, and Flushing Task instead of entire MC, RTOS, and middleware.
The key contributions of our work are:
• Construction of formally verified base controllers for safety-critical applications with nonlinear physical components which guarantee safe full system restart for application and system level fault tolerance.
• Tolerating application-level faults as well as system-level faults using only one COTS processing unit.
• Empirical validation of both the practicality of our proposed design and the safety guarantees through fault-injection testing on a prototype controller for the nonlinear inverted pendulum system and a 3-DOF helicopter.
Related Work
The concept of utilizing an unverified, complex controller along with a simple, verified safety controller for fault tolerance was initially proposed as Simplex Architecture in [Sha98, Sha01, SRG96] . In earlier simplex designs, fault tolerance was achieved in one of two ways. In some of these designs such as [Sha98, SS99, Sha01, SRG96, CGR + 07], all three components (safety controller, complex controller, and decision unit) share the same computing hardware (processor) and software platform (OS, middleware). As a result, these designs only protect the safety against the faults in the application logic of the complex controller. And, there is no guarantee of the correct behavior in the presence of system-level faults. Our proposed design protects the system from both application-level and system-level faults.
Some Simplex-based designs such as System-Level Simplex [BCA + 09], Secure System Simplex Architecture (S3A) [MBB + 13] , and other variants [VGYK16] run the safety controller and the decision logic on an isolated, dedicated hardware unit. By doing so, the trusted components are protected from the faults in the complex subsystem. However, exercising System-Level Simplex design on most COTS multicore platforms/SoC (system on chip) is challenging. The majority of commercial multicore platforms is not designed to achieve strong intercore fault isolation due to the high-degree of hardware resource sharing. For instance, a fault occurring in a core with the highest privilege level may compromise power and clock configurations of the entire platform. To achieve full isolation and independence, one has to utilize two separate boards/systems. In contrast, the approach proposed in this paper needs only one processor and tolerates system-level faults.
RTOS
Note that, the control domain of the proposed BC depends on the system dynamics and the restart time of the platform. Increased restart time, shrinks the domain of the BC. For a given system, it may be empty; meaning that the dynamics of the system does not allow a controller with such properties to exist. System-Level Simplex does not have this limitation because it uses a dedicated hardware that is not impacted by faults (or restarts) in the complex controller unit. The proposed approach is especially suitable for the Internet of Things (IoT) applications, requiring increased robustness at low cost and without using extra hardware as System-Level Simplex requires.
The notion of restarting as a means of recovery from faults and improving system availability is previously proposed in the literature. These approaches are generally divided into two categories, viz., i ) revival, reactively restart a failed component and ii ) rejuvenation, prophylactically restart functioning components to prevent state degradation [CCF04] . Our approach, as described in this paper, fits in the former category. However, with slight modification, our design can incorporate periodic self-triggered restarts to prevent future unscheduled unavailable times. In the second form, this work can also be categorized in the latter category.
Most of the previous works on restarting are proposed for traditional non safety-critical computing systems such as servers and switches. Authors in [CF01] introduce recursively restartable systems as a design paradigm for highly available systems and use a combination of revival and rejuvenation techniques. Earlier literature [GA03, CKZ + 03, CKF + 04] illustrates the concept of microreboot which consists of having fine-grain rebootable components and trying to restart them from the smallest component to the biggest one in the presence of faults. The works in [VT05, GPTT95, HKKF95] focus on failure and fault modeling and try to find an optimal rejuvenation strategy for various systems. In this context, our previous work in Reset-Based Recovery [AMB + 16] was an attempt to utilize restarting as a recovery method for computing systems in safetycritical environments. In [AMB + 16], we used System-Level Simplex architecture and proposed to restart only the complex subsystem upon the occurrence of faults. This is feasible because the safety subsystem runs on a dedicated hardware unit and is not impacted by the restarts in the complex subsystem. The approach of the current paper is significantly different and uses only one hardware unit.
3. Control System Description 3.1. Notations. The symbols N, N 0 , Z, R, R + , and R + 0 denote the set of natural, nonnegative integer, integer, real, positive, and nonnegative real numbers, respectively. We use R n×m to denote a vector space of real matrices with n rows and m columns. The identity matrix in R n×n is denoted by I n and zero matrix in
. We identify the relation R ⊆ A × B with the map
• R denotes the composition of maps Q and R, Q • R(x) = Q(R(x)). The map R is said to be strict when R(a) = ∅ for every a ∈ A.
3.2. Nonlinear Control Systems.
Definition 3.1 (Nonlinear control systems). A nonlinear control system is a tuple Σ = (R n , U, U, f ), where R n is the state space; U ⊆ R p is a bounded input set; U is a subset of the set of all functions of time from R + 0 to U; and f is a locally Lipschitz continuous map from
The trajectory ξ is said to be a solution of Σ if there exists υ ∈ U satisfying:
for any t ∈ R + 0 . We emphasize that the locally Lipschitz continuity assumption on f ensures existence and uniqueness of solution ξ [Son13] . We use notation ξ x,υ (t) to denote the value of solution at time t under the input signal υ and starting from initial state x = ξ x,υ (0).
3.3. Formulating Safety. In physical systems, maintaining all states and control inputs within safe limits is very important in order to avoid damages to the system itself or the environment around it. In this paper, we define safety region S as a subset of the state space. For example, one can define it as:
In a similar way, the bounds on operational ranges of control inputs can be expressed as:
The nonlinear control system Σ is said to be safe if the states of the system remain inside S using only the control commands in S u .
3.4. Reachable Set. Consider a nonlinear control system as in (1) and a set X 0 ⊂ R n . The reachable set of states that can be reached starting from set X 0 under input signal υ at time τ is given by Reach τ (X 0 , υ) := x∈X0 ξ x,υ (τ). We use notation Reach [0,τ] (X 0 , υ) to denote the reachable set that can be reached starting from X 0 under input signal υ up to time τ and can be defined as
We use the notation Reach τ (X 0 , υ) to denote an over-approximation of the set Reach τ (X 0 , υ).
Design Approach
As depicted in Figure 1 , the proposed design consists of three main components; Base Controller (BC), Mission Controller (MC) and Decision Module (DM).
The BC is a verified, reliable controller that is only concerned with safety. It does not make progress towards the mission set points of the system (i.e., it does not provide liveness). The MC, on the other hand, is the main controller which is concerned with the mission-critical requirements. This controller may have complex logic, can be changed and upgraded while the system is running and may even contain unsafe logic and bugs.
As an example, MC may be a neural network resulted from machine learning techniques.
All the components of the system run on top of the RTOS. The length of one control cycle of the system is τ c . The kth control cycle refers to the period [(k − 1)τ c , kτ c ], where k ∈ N. The cycles count and the time origin are restarted after every system restart. Therefore, k = 1 always refers to the first cycle after the latest system restart. Furthermore, we assume that the length of the restart time 3 , i .e., τ r , of the system is an integer multiple 4 of τ c (i.e., τ r = mτ c , where m ∈ N). While the system is running, sensor values are sampled at t = kτ c − where τ c and actuator inputs are updated at t = kτ c .
In every control cycle, after MC runs and generates its output u mc , DM evaluates the safety requirements under u mc and decides whether u mc can be applied to the actuators. Then, DM writes its output, along with the corresponding MC command and a timestamp (cycle number) to a fixed memory address.
At the end of the control cycle, at time kτ c − after sensors are sampled, BC runs and generates u bc . Then a flushing task retrieves u mc , u bc , the decision of DM and the corresponding timestamp from the memory. If the timestamp matches with the current cycle number, k, it updates the actuator commands with u mc or u bc based on the decision of DM and resets watchdog timer (WD). Non-matching timestamps indicate that one or both of the DM and BC tasks did not execute or missed their deadlines. In such cases, the flushing task does not update the WD. Consequently, WD expires at t = kτ c and triggers a restart. Note that as a result of this mechanism, restarts are only triggered at times t = kτ c and do not occur in between control cycles. The steps are illustrated in Figure In the rest of this section, we discuss the assumptions and the fault model of the system. Then, we introduce the properties of the BC and how it is able to safely tolerate the restarts. Finally, we discuss the safe switching logic of the DM.
Assumptions and Fault Model.
In this section, we clarify several assumptions we make about faults and components of the system.
• In this work, we are not concerned about hardware faults and we assume that hardware is reliable.
3 It includes the time for reloading the bootloader, OS, and the applications from the read-only storage, initializing the necessary sensors and peripheral, booting the OS and executing the control applications. 4 Restart time can be rounded up to match the closest kτc.
• BC, DM, and flushing task are independently verified and fault-free. They might, however, fail silently (no output is generated) due to the faults in the previously dependent software layers or other applications.
• System-level and application-level faults may cause BC, DM, and flushing task to fail silently but may not change their logic or alter their output.
• Once a command is sent to an actuator input; the actuator holds that value until the control system sends a new actuation command. Therefore, during a system-level restart, the actuators operate with the last command that was sent before the restart occurred 5 .
• We assume that the system-level faults do not happen within the first τ r seconds after the boot is complete so that the BC has the chance to execute correctly at least once. In other words, this assumption implies that the system is not completely dysfunctional. In Section 4.2, we demonstrate the necessity of this assumption.
4.2. Properties of the Base Controller. In this section, we provide the properties required for the BC as follow:
There exists a subset I of the state space, such that for all x ∈ I at time t 0 ∈ R + 0 , there exists a control command u bc ∈ S u , such that:
Note that, in the rest of the paper we assumed that the actuators hold the control input constant within the period of [t 0 , t 0 + τ c + τ r ].
Intuitively, above properties imply that if the current state of the system is inside I, BC is able to generate a control command that keeps the physical system safe. For the intuition, consider t 0 = kτ c . Property (i) implies that one control cycle after u bc is applied to the actuators, at the end of (k + 1)th cycle, state is inside I. Therefore, if the system is still running and no faults have occurred, BC is able to find another safe command at t = (k + 1)τ c . If a fault had occurred within the (k + 1)th cycle, a restart will be triggered at the end of the cycle and BC will not be available to update the actuator input. Property (ii) implies that in such a case, the system will be in I, after the restart completes. This guarantees that the system can be kept safe after the restart completes. Finally, property (iii) ensures that the system remains inside the safety region during (k + 1)th cycle and a possible consequent restart.
A BC with the above properties, without any other components, can keep the system safe, only if it updates the actuator commands at least once after every restart τ r . Therefore, it is necessary for the system to not have any system-level faults within the first τ r seconds after the restart.
Switching Logic of DM.
A system with only BC remains safe and tolerates restarts but it does not make any progress towards the mission goal. In order to maximize the progress towards the mission goal, it is desirable to use the MC command in every cycle whenever it is possible.
In every cycle k, DM runs and evaluates the following conditions. If those conditions hold, u mc is safe to be applied to the actuator inputs at the end of the cycle (i.e., at time t = kτ c ). Otherwise, DM chooses u bc . Following conditions guarantee that the system remains safe and recoverable under u mc whether it restarts or not.
Here, τ r and τ c are the length of the restart time and of the control cycle of the platform. Notationx [k] denotes the state of the system when the actuator command is going to be applied to the system (i.e., the end of the cycle at time t = kτ c ).
From properties of the BC, it is known that if the state is inside I, BC can find a control command that keeps the system in safe and restartable region. Condition (i) ensures that one control cycle after u mc is applied to the system the state will be inside I. If no faults occur within the control cycle, BC is guaranteed to be able to find a safe control for the system. However, if a fault occurs within the cycle, WD triggers a system restart at the end of the cycle. Condition (ii) ensures that state will be inside I when the restart completes (i.e., at τ c + τ r ). Furthermore, condition (iii) guarantees that during the control cycle and restart time (if it happens) state remains inside the safety region.
Note that, in the real implementation, calculating reachable set and therefore evaluating these conditions requires time and does not happen instantaneously. Therefore, assuming k is the current cycle, above conditions have to be assessed before t = kτ c . At this time, however, x[k] = x(kτ c ) (state of the system when the actuator command is going to be updated) is not available yet. To address this issue, above conditions usex[k] which is the over-approximated prediction of x[k] based on x[k − 1] (sampled sensor values in the previous cycle). Predictionx[k] can be computed in the following way:
where x[k − 1] is the sampled state at the previous cycle (state of the system at the beginning of the current control cycle). Input u k−1 is the control command sent to the actuators in the previous cycle. Since, in the first control cycle after a restart, u k-1 is not available, the DM always chooses the BC in the first cycle. To compute an over-approximation of reachable set for nonlinear control systems there are various approaches available in literature for example see [RWR17, Section VIII.c], [AK14] , and [ADG03].
Base Controller Design
In this section, we provide a systematic approach to design base controllers ensuring properties mentioned in Subsection 4.2. To design BC, we use symbolic controller synthesis approach which uses the discrete abstractions of nonlinear physical systems [Tab09] . The advantage of using this approach is that it provides formally verified controllers for high-level specifications (usually expressed as linear temporal logic (LTL) formulae [BK08] ). One can readily see that the properties given in Subsection 4.2 are equivalent to invariance specification.
5.1. Transition Systems and Equivalence Relation. We recall the notion of transition system introduced in [Tab09] which will later be used as unified framework to represent nonlinear control systems and corresponding discrete abstractions.
Definition 5.1 (Transition system). A transition system is a tuple S = (X, X 0 , U, −→) where X is a set of states, X 0 ⊆ X is a set of initial states, U is a set of inputs, −→⊆ X × U × X is a transition relation.
We denote by x u −→ x an alternative representation for transition (x, u, x ) ∈−→, where state x is called a u-successor (or simply successor) of state x , for some input u ∈ U . We denote by P ost u (x) the set of all u-successors of state x, and by U (x) the set of all admissible inputs u ∈ U for which P ost u (x) is non-empty. Now, we provide the notion of feedback refinement relation between two transition systems, introduced in [RWR17] , which is later used to construct discrete abstractions and base controllers for nonlinear control systems Σ. 
is a feedback refinement relation from S 1 to S 2 if following conditions hold for every pair (x 1 , x 2 ) ∈ Q:
and the feedback refinement relation from S 1 to S 2 is denoted by S 1 Q S 2 .
Intuitively, the above relation says that all admissible inputs of S 2 can be used in transition system S 1 such that all transitions in S 1 are associated with corresponding transitions in S 2 . As a result, one can easily refine controller synthesized for S 2 using feedback refinement relation Q to make it compatible for S 1 . Further details about feedback refinement relation and its role in the controller synthesis can be found in [RWR17] .
5.2. Sampled-Data Control System as a Transition System. As we discussed in the previous sections, the sampling time can take any value in h = {τ c , τ r + τ c } depending on the occurrence of fault. We assume that the value of control input is held for the respective sampling period. The transition system associated with the nonlinear control system Σ with such a sampling behavior can be given by the tuple
where For the transition system S h (Σ), the finite or infinite run generated from initial state x 0 ∈ X h0 is given by
By considering properties of BC mentioned in Subsection 4.2, one can view it as a safety controller synthesis problem for S h (Σ).
Definition 5.3 (Safety controller). Consider a safe set S ⊆ R
n as given in Subsection 3.3, a safety controller for S h (Σ) is given by a map C h : X h → 2 U h such that:
Essentially, a safety controller generates infinite runs
. . . such that x i ∈ S, for all i ∈ N 0 . At the end of this section, we provide a systematic way to compute such controller for linear control systems. However, finding such a control strategy for complex nonlinear control systems is quiet difficult. This motivates the use of abstraction-based synthesis methods described below.
5.3. Discrete Abstraction. To design controllers for the concrete system S h (Σ) from its abstraction, the system and its abstraction must satisfy formal behavioural inclusions in terms of feedback refinement relations. Consider sampling times τ c , τ r + τ c ∈ R + and quantization parameter η ∈ (R + ) n . The discrete abstraction of S h (Σ) is given by the tuple
where • X q is a cover of X h and elements of the cover X q are nonempty, closed hyper-intervals referred to as cells. For computation of the abstraction, we consider subset X q ⊆ X q of congruent hyperrectangles aligned on a uniform grid parameterized with quantization parameter η ∈ (R + ) n and given by ηZ n = {c ∈ R n | ∃ k∈Z n ∀ i∈{1,2,...,n} c i = k i η i }, i.e., x q ∈ X q implies that there exists c ∈ ηZ n with x q = c + η 2 , η 2 . The remaining cells X q \ X q are considered as "overflow" symbols, see [Rei11, Sec
∅}. If A ⊆ X q , then P ost u (x q ) = A, and otherwise P ost u (x q ) = ∅. Moreover, P ost u (x q ) = ∅ for all x q ∈ X q \ X q .
For the exact procedure to compute such discrete abstraction, we refer interested readers to [RZ16] .
Theorem 5.4. If S q (Σ) is a discrete abstraction of S h (Σ) with sampling times τ c , τ r + τ c ∈ R + , and quantization parameter η ∈ (R + ) n , then S h (Σ) Q S q (Σ).
Proof. The proof is similar to the proof of [RWR17, Theorem VIII.4].
The abstract safe setŜ for S q (Σ) is given byŜ := {x q ∈ X q | Q −1 (x q ) ⊆ S}.
Controller Synthesis and Refinement.
In this section, we consider the problem of synthesis of safety controller C h for S h (Σ) and safe set S. Because of the feedback refinement relation, we can solve safety controller synthesis problem for the discrete abstraction S q (Σ) and abstract safe setŜ. Let C q : X q → U q be the maximal safety controller satisfying conditions in Definition 5.3 for S q (Σ) and safe setŜ. Since S q (Σ) has finite states and inputs, we can use standard maximal fixed-point computation algorithm [Tab09] for the computation of C q . One can easily refine this controller for S h (Σ) and safe set S using the following theorem:
Theorem 5.5. If S h (Σ) S q (Σ) and C q is the safety controller for S q (Σ) andŜ, then the refined controller C h := C q • Q solves the safety problem for S h (Σ) and S.
Proof. The proof is similar to the proof of [RWR17, Theorem VI.3].
Intuitively, the refined controller C h for S h can naturally be obtained from the abstract controller C q by using the feedback refinement relation Q as a quantizer to map x h to x q ∈ Q(x h ).
Remark 5.6. The obtained controller C h solves the safety problem for the sampled system, i.e., the obtained base controller satisfies the first two properties mentioned in Subsection 4.2 with invariant set I = dom(C h ). However, one can ensure safety guarantee of inter-sampling trajectory (i.e., third property in Subsection 4.2) by shrinking the safe set by a magnitude computed using the global Lipschitz continuity property of map f .
Despite the applicability of the proposed approach for complex and nonlinear control systems, it suffers from the curse of dimensionality, i.e., the computational complexity increases exponentially with state-space dimensions of concrete systems. There are few results available to address this issue for some class of nonlinear control systems [ZTA17, ZA17] . In next subsection, we provide an alternative approach to compute invariant set I and BC for linear control systems.
Base Controller for Linear Control Systems.
In this subsection, we provide an algorithm to compute I using discretized linear-control systems. The continuous linear control system can be converted to a discrete control system with the sampling time of τ c as:
where
In this subsection, we show how to construct a BC with the properties: ∀x[k] ∈ I, ∃u 0 , where u[p] = u 0 , p ∈ {k, k + 1, ..., k + m} such that (i) x[k + 1] ∈ I and (ii) x[k + 1 + m] ∈ I, where m = τ r /τ c and m ∈ N. However, these properties does not guarantee that the inter-sample behaviour is in the safe region. To address this issue we need to readjust the safe region. For the systematic procedure to compute readjusted safe region S ⊆ S, we refer interested reader to [ATR + 17, Section 5.1] 5.5.1. Finding the Invariant Subset I. To compute the set I, we closely follow the usual construction method based on backwards reachable sets to compute the largest invariant set for linear discrete-time systems (see e.g. in [BM08] ). We slightly modify this procedure and present it in Algorithm 1 to compute the subset I ⊆ S , such that for the discrete-time system in (4), I satisfies the properties in Subsection 4.2.
ALGORITHM 1: Computing the invariant subset I.
STOP successfully. Intuitively, this algorithm starts from S as initial region (line 2). In every iteration of this algorithm, this region is augmented in the extended state-control space R n+m (line 5). This linear inequality is then projected back into the state space (line 6). The outcome of lines 5 and 6 is to calculate I (p+1) which is a subset of I
where a control value in S u exists such that, the state in one cycle and m + 1 cycle after are inside I (p) .
The algorithm proceeds until either I (p) ⊆ I (p+1) or I (p+1) = ∅. In the former case, procedure successfully ends (lines 7 to 8). The latter case indicates that the dynamics of the system does not allow such a region, for the given restart time. There are cases in which the procedure does not terminate in a finite number of steps unless a finite p max is fixed. This may happen if I (∞) has an empty interior, but it is not empty [BM08] .
If matrix A d and B d are controllable, we can use ideas from [RT17] to ensure convergence. However, in general, we cannot guarantee that the procedure in Algorithm 1 will converge to a non-empty I. In such cases, one may have to loosen the safety constraints of the system (i.e., S) or may have to switch to a hardware platform with a shorter restart time, to be able to apply this approach.
Base Controller in Runtime.
Using invariant set I computed as given in Subsection 5.5.1, one can compute a control input u[k] at kth sampling instance that satisfies the following conditions:
By substituting x[k + 1] and x[k + m + 1], we get the following linear matrix inequalities:
At runtime, BC receives the sensors values i.e., x[k], and calculates u[k] by solving these inequalities.
Case study and Evaluation
To demonstrate the practicality of the proposed approach, we implemented a controller for two benchmark systems: (i) inverted pendulum system and (ii) 3-DOF helicopter[Inc18b] and empirically verify fault-tolerance guarantees. We utilize one COTS platform to implement our controller. We inject faults in the control logic, control application, and the operating system to demonstrate that the system remains safe, despite the faults, and recovers. 6.1. Experimental Setup. For the prototype of the proposed design, an i.MX7D application processor is used. This SoC provides two general purpose ARM Cortex-A7 cores capable of running at the maximum frequency of 1 GHz and one real-time ARM Cortex-M4 core that runs at the maximum frequency of 200MHz. The real-time core runs from tightly coupled memory to ensure predictable behavior required for the realtime applications/tasks. The real-time core of the considered platform runs FreeRTOS [fre18] , an operating system for real-time applications. Because our control tasks have real-time constraints, we implement our controller on the real-time core. Ideally, the general purpose cores would have been completely disabled for the experiments. However, in i.MX7D platform, only Cortex-A7 cores have direct access to the flash memory and, only these two cores can load the binary images of the real-time core from flash into the real-time core's memory after each restart. Hence, instead of permanently disabling those cores, they are only disabled after the software of the real-time core is loaded from flash into the memory. Note that, this mechanism is specific to this particular platform and does not impact the generality of our proposed technique. The manufacturer's boot procedure of the board is designed to boot the general purpose cores and the realtime core at the same time. It includes extra initialization procedures that are necessary only for running the general purpose core's kernel and mounting its file system. It loads the real-time core code only after those procedures are completed.
To reduce the boot time of the real-time core, we made two modifications to the bootloader (u-boot) source code which can be found in BC, DM, and flushing task) as a static array in the u-boot source code and made it part of the u-boot binary after compilation.
(ii) In our modified boot process, at the boot time, the general purpose processor copies u-boot binary (that includes the FreeRTOS and application binaries) from the SD-card into the RAM. After successful initialization of only the necessary peripherals and configuring the clock by the u-boot procedures, u-boot loads the binaries of the real-time core in its tightly coupled memory and releases it from reset. These modifications reduce the real-time core's boot time from seconds to less than 250ms 6 . 6.2. Example 1: Inverted Pendulum. For the first case study, we consider a nonlinear inverted pendulum system [Rei11] given by nonlinear differential equations as:
with parameters ω = 1 and γ = 0.0125. The states ξ 1 and ξ 2 are the angular position with respect to a downward vertical axis and the angular velocity of the pendulum, respectively. The control input υ(t) is restricted to [−4, 4] . We design MC as υ mc = 2(π − ξ 1 − ξ 2 ) to stabilize the pendulum at upright position that is ξ = [π, 0]
T which runs with frequency of 20Hz (i.e. τ c =50ms) on real-time core of i.MX7D. To ensure safety of the system (i.e. to avoid pendulum to fall down), we consider safety region for the states given by a polytope parameterized by
To ensure fault-tolerance and safety during restart, we designed BC using abstraction-based approach as discussed in Section 5. To synthesize BC, we first constructed a discrete abstraction of the pendulum system in (7) using quantization parameter η = [0.05, 0.1] T , sampling time τ c = 0.050, and restart time τ r = 0.250.
Further, we synthesize a safety controller using maximal fixed point computation algorithm. For the controller synthesis, we used toolbox SCOTS [RZ16] with some modifications to adapt the construction of abstraction given in Subsection 5.3 . The invariant states computed using the proposed approach is shown in Figure  3 . To verify the efficacy of the designed controller, we implemented it on our experimental setup (i.MX7D) and tested in the closed-loop with inverted pendulum dynamics simulated in the computer under various test scenarios discussed in Subsection 6.4. the i.MX7D through the serial port. At the end of every control cycle, a flushing task on the real-time core communicates with the PC to receive the sensor readings (elevation, pitch, and travel angles) and send the motors' voltages. It also updates the hardware WD of the platform after sending the motor voltages. The PC uses a custom driver written for Linux to send the voltages to the 3DOF helicopter motors and reads the sensor values.
6.3.2. Testing the Base Controller. To verify that the constructed base controller has the desired properties, we simulated the system with this controller from all vertices of region I as starting points and observed that the system's state at τ c and τ c + τ r time units after actuation was inside I. platform to validate our design approach under different fault scenarios given in the next section.
6.4. Fault Injection. In Table 1 , a list of faults that were tested on the implementations are provided. We also compare them with Application-Level Simplex and System-Level Simplex. For the application-level faults, we verified that the mission controller was able to actuate the system as long as it did not jeopardize the safety and when the system states approached the states where the safety conditions violated, BC took over and ensure safety. For the system-level faults, we observed that the WD restarted the system and after restart, the system continued its operation.
Some of these faults are elaborated in the rest of this section.
6.4.1. Maximum Control Input in Wrong Way. The system should not leave safe region even if the MC outputs a control input that normally would result in a crash. We consider an extreme case of this scenario where the MC generates a control input that forces system towards the unsafe region. The unsafe MC commands were detected by DM (they did not satisfy the system safety conditions), and the control was switched to the BC until the system was in the safety region and then control was handed back to MC.
6.4.2. Timing Faults (CPU and Resource). The proposed solution also protects the system from timing faults. A faulty task may behave differently in runtime from its expected/reported behavior. For instance, it may lock a particular resource used by other critical tasks for more than the intended duration. Or, it may run for more time than its reported worst-case execution time (WCET) which was used for the schedulability test of the system. Timing faults may also originate from RTOS or driver misbehaviors. If the fault delays/stops the execution of the DM or BC, WD will trigger a system-wide restart. This recovers the system from the fault and keeps the physical system safe. We perform two experiments to test the fault-tolerance against timing faults.
In the first experiment, we run an additional task on the system that uses the serial port in parallel to the flushing task to communicate with the PC. We inject a fault into this task so that in random execution cycles, it holds the lock on the serial port for more than its intended period. This prevents the flushing task from updating the actuator (which needs the serial port) before the end of the control cycle. As a result, WD expires and restarts the system. We verified that the system recovers from the fault and remains safe during the restart.
In the second test, we introduce a task that runs at the same priority as the BC and DM. We inject a fault into the task such that in some cycles, its execution time exceeds its reported WCET. FreeRTOS runs the tasks with equal priority using round-robin scheduling with a context switch at every 1ms. Therefore, the faulty task delays the response time of the DM and BC. If the interference is too long, the output of BC may not be ready by the time the flushing task needs to update the actuators. When this happens, WD restarts the system.
Discussion
Software Faults: The proposed approach does not handle software faults that modify the program logic or output of the BC and the DM at the execution time. Utilizing frameworks such as ARM TrustZone [Inc18a] and limiting the access to these critical components can mitigate this issue.
Restart Time: As the restart time of the platform increases, the domain of the BC shrinks. Therefore, the proposed solution in its current form, even though useful for many platforms, may not suit some platforms with a longer restart time.
We are actively working on an alternative multi-stage booting solution for multicore platforms to mitigate this problem. Our main idea is to boot one core with the bare minimum requirements to execute the BC in the shortest possible time. The BC can keep the system safe, while the real-time or general purpose OS boots on the other cores. Once the boot process is complete, the control switches to the controllers running on the OS. As a future extension, we are working on implementing this solution on i.MX7D platform. We first boot the real-time core with a FreeRTOS and run the BC on top of it and then boot an embedded Linux on the general purpose core.
Conclusion
Restarting is considered as a reliable way to recover the traditional computing system from software faults. However, restarting safety-critical CPS is challenging. In this work, we propose a novel approach that guarantees safety and liveness of nonlinear physical systems in the presence of application and system-level software faults utilizing only one COTS processor based on complete system-level restarts.
