Towards “Ultra-Reliable” CPS: Reliability Analysis of Distributed Real-Time Systems

Abstract

In the avionics domain, “ultra-reliability” refers to the practice of ensuring quantifiably negligible residual failure rates in the presence of transient and permanent hardware faults. If autonomous Cyber- Physical Systems (CPS) in other domains, e.g., autonomous vehicles, drones, and industrial automation systems, are to permeate our everyday life in the not so distant future, then they also need to become ultra-reliable. However, the rigorous reliability engineering and analysis practices used in the avionics domain are expensive and time consuming, and cannot be transferred to most other CPS domains. The increasing adoption of faster and cheaper, but less reliable, Commercial Off-The-Shelf (COTS) hardware is also an impediment in this regard. Motivated by the goal of ultra-reliable CPS, this dissertation shows how to soundly analyze the reliability of COTS-based implementations of actively replicated Networked Control Systems (NCSs)—which are key building blocks of modern CPS—in the presence of transient hard- ware faults. When an NCS is deployed over field buses such as the Controller Area Network (CAN), transient faults are known to cause host crashes, network retransmissions, and incorrect computations. In addition, when an NCS is deployed over point-to-point networks such as Ethernet, even Byzantine errors (i.e., inconsistent broadcast transmissions) are possible. The analyses proposed in this dissertation account for NCS failures due to each of these error categories, and consider NCS failures in both time and value domains. The analyses are also provably free of reliability anomalies. Such anomalies are problematic because they can result in unsound failure rate estimates, which might lead us to believe that a system is safer than it actually is. Specifically, this dissertation makes four main contributions. (1) To reduce the failure rate of NCSs in the presence of Byzantine errors, we present a hard real-time design of a Byzantine Fault Tolerance (BFT) protocol for Ethernet-based systems. (2) We then propose a quantitative reliability analysis of the presented design in the presence of transient faults. (3) Next, we propose a similar analysis to upper-bound the failure probability of an actively replicated CAN-based NCS. (4) Finally, to upper-bound the long-term failure rate of the NCS more accurately, we propose analyses that take into account the temporal robustness properties of an NCS expressed as weakly-hard constraints. By design, our analyses can be applied in the context of full-system analyses. For instance, to certify a system consisting of multiple actively replicated NCSs deployed over a BFT atomic broadcast layer, the upper bounds on the failure rates of each NCS and the atomic broadcast layer can be composed using the sum-of-failure-rates model

    Similar works