Increased reliability on Intel GPUs via software diverse redundancy

Abstract

In the past decade, Artificial Intelligence has revolutionized various industries, including automotive, avionics, and health sectors. The installation of Advanced Driver Assistance Systems (ADAS) is now a reality, with the goal of achieving fully self-driving cars (SDCs) in the near future. ADAS and Autonomous Driving (AD) systems require processing vast amounts of data at high frequency using complex algorithms (Deep Learning (DL)) to meet tight time constraints (Real Time (RT)). Traditional computing has become a bottleneck, with CPUs unable to handle the data efficiently. High-performance GPUs have partially fulfilled these timing constraints, leading to continuous innovation in device performance and efficiency. For example, Nvidia introduced the Jetson AGX Xavier SoC in 2017, designed for machine learning applications in the automotive sector. However, AD and ADAS challenges also involve safety constraints, such as functional safety. Redundancy is necessary for identifying and correcting erroneous outcomes. To ensure high safety levels, diverse redundancy is used to avoid common cause faults (CCF). High-performance hardware for AD must be verified and validated (V&V) to ensure safety goals, but these processes can be costly. The automotive industry seeks to avoid non-recurring costs by using commercial off-the-shelf products (COTS). However, COTS devices have drawbacks, including limited redundancy and guarded implementation details. Researchers are developing software-only diverse redundancy solutions on top of COTS devices to overcome these limitations. Two main challenges are ensuring redundant computation for error detection and guaranteeing diverse redundancy to detect errors even when they affect all replicas. Current solutions are limited and mostly focused on NVIDIA GPUs. This thesis presents a software-only solution for diverse redundancy on Intel GPUs, providing strong diversity guarantees for the first time. Built on OpenCL, a hardware-agnostic programming language, the technique relies on intrinsics-special functions optimized by integrators. The intrinsics enable identifying hardware threads on the GPU and smart tailoring of workload geometry and allocation to specific computing elements. As a result, redundant threads use physically diverse execution units, meeting diverse redundancy requirements with affordable performance overheads. Several scenarios are developed to measure the impact of modifications to a standard OpenCL kernel execution. First, allocating only half of the available GPU resources; then, overriding the scheduler to use half of the resources; next, duplicating the work to mimic two kernel execution; and finally, executing both kernels in independent parts of the GPU

    Similar works