Search CORE

26 research outputs found

IMPROVING THE PERFORMANCE AND TIME-PREDICTABILITY OF GPUs

Author: Huangfu Yijie
Publication venue: VCU Scholars Compass
Publication date: 01/01/2017
Field of study

Graphic Processing Units (GPUs) are originally mainly designed to accelerate graphic applications. Now the capability of GPUs to accelerate applications that can be parallelized into a massive number of threads makes GPUs the ideal accelerator for boosting the performance of such kind of general-purpose applications. Meanwhile it is also very promising to apply GPUs to embedded and real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, how to fully utilize the advanced architectural features of GPUs to boost the performance and how to analyze the worst-case execution time (WCET) of GPU applications are the problems that need to be addressed before exploiting GPUs further in embedded and real-time applications. We propose to apply both architectural modification and static analysis methods to address these problems. First, we propose to study the GPU cache behavior and use bypassing to reduce unnecessary memory traffic and to improve the performance. The results show that the proposed bypassing method can reduce the global memory traffic by about 22% and improve the performance by about 13% on average. Second, we propose a cache access reordering framework based on both architectural extension and static analysis to improve the predictability of GPU L1 data caches. The evaluation results show that the proposed method can provide good predictability in GPU L1 data caches, while allowing the dynamic warp scheduling for good performance. Third, based on the analysis of the architecture and dynamic behavior of GPUs, we propose a WCET timing model based on a predictable warp scheduling policy to enable the WCET estimation on GPUs. The experimental results show that the proposed WCET analyzer can effectively provide WCET estimations for both soft and hard real-time application purposes. Last, we propose to analyze the shared Last Level Cache (LLC) in integrated CPU-GPU architectures and to integrate the analysis of the shared LLC into the WCET analysis of the GPU kernels in such systems. The results show that the proposed shared data LLC analysis method can improve the accuracy of the shared LLC miss rate estimations, which can further improve the WCET estimations of the GPU kernels

VCU Scholars Compass

Analysis and modeling of the timing behavior of GPU architectures

Author: Voudouris P.
Publication venue
Publication date: 01/01/2014
Field of study

Repository TU/e

Pure OAI Repository

Dynamic Memory Bandwidth Allocation for Real-Time GPU-Based SoC Platforms

Author: Aghilinasab Homa
Publication venue: 'University of Waterloo'
Publication date: 14/05/2020
Field of study

Heterogeneous SoC platforms, comprising both general purpose CPUs and accelerators such as a GPU, are becoming increasingly attractive for real-time and mixed-criticality systems to cope with the computational demand of data parallel applications. However, contention for access to shared main memory can lead to significant performance degradation on both CPU and GPU. Existing work has shown that memory bandwidth throttling is effective in protecting real-time applications from memory-intensive, best-effort ones; however, due to the inherent pessimism involved in worst-case execution time estimation, such approaches can unduly restrict the bandwidth available to best-effort applications. In this work, we propose a novel memory bandwidth allocation scheme where we dynamically monitor the progress of a real-time application and increase the bandwidth share of best-effort ones whenever it is safe to do so. Specifically, we demonstrate our approach by protecting a real-time GPU kernel from best-effort CPU tasks. Based on profiling information, we first build a worst case execution time estimation model for the GPU kernel. Using such model, we then show how to dynamically recompute on-line the maximum memory budget that can be allocated to best-effort tasks without exceeding the kernel’s assigned execution budget. We implement our proposed technique on NVIDIA embedded SoC and demonstrate its effectiveness on a variety of GPU and CPU benchmarks

University of Waterloo's Institutional Repository

A Perspective on Safety and Real-Time Issues for GPU Accelerated ADAS

Author: Nicola Capodieci
Roberto Cavicchioli
Sanudo Olmedo Ignacio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

The current trend in designing Advanced Driving Assistance System (ADAS) is to enhance their computing power by using modern multi/many core accelerators. For many critical applications such as pedestrian detection, line following, and path planning the Graphic Processing Unit (GPU) is the most popular choice for obtaining orders of magnitude increases in performance at modest power consumption. This is made possible by exploiting the general purpose nature of today's GPUs, as such devices are known to express unprecedented performance per watt on generic embarrassingly parallel workloads (as opposed of just graphical rendering, as GPUs where only designed to sustain in previous generations). In this work, we explore novel challenges that system engineers have to face in terms of real-time constraints and functional safety when the GPU is the chosen accelerator. More specifically, we investigate how much of the adopted safety standards currently applied for traditional platforms can be translated to a GPU accelerated platform used in critical scenarios

Crossref

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Timing Analysis of General Purpose Graphics Processing Units for Real-Time Systems: Models and Analyses

Author: Kostiantyn Berezovskyi
Publication venue
Publication date: 20/04/2016
Field of study

Repositório Aberto da Universidade do Porto

Recommended from our members

A SIMD architecture for hard real-time systems

Author: Spliet Roy
Publication venue: University of Cambridge
Publication date: 31/03/2020
Field of study

Emerging safety-critical systems require high-performance data-parallel architectures and, problematically, ones that can guarantee tight and safe worst-case execution times. Given the complexity of existing architectures like GPUs, it is unlikely that sufficiently accurate models and algorithms for timing analysis will emerge in the foreseeable future. This motivates a clean-slate approach to designing a real-time data-parallel architecture. In this work I present Sim-D: a wide-SIMD architecture for hard real-time systems. Similar to GPUs, Sim-D performs hardware strip-mining to schedule the work for a compute kernel in entities called work-groups. Sim-D schedules the work for each work-group as a sequence of uninterruptible access- and execute program phases, interleaving the phases of two work-groups. By providing performance isolation between the memory- and compute resources, the execution time of each phase can be tightly bound through static analysis. I present a predictable closed-page DRAM controller that processes requests for large 1D- and 2D blocks of data, as well as indirect indexed transfers. These large transfers coalesce the data requests of a whole work-group. For a linear 4KiB transfer over a 64-bit data bus, the utilisation provably exceeds 78% for DDR4-3200AA DRAM. For 2D blocks, a well-chosen tiling configuration can achieve near-similar efficiency. I show that bounds on the execution time of indexed transfers are pessimistic by nature, but propose a novel snoopy indexed transfer mechanism that permits more reasonable bounds when the buffer size is limited. Finally, I present a worst-case execution time calculation algorithm for Sim-D. This algorithm is paired with two hardware work-group scheduling policies that deterministically reduce run-time variance. The worst-case execution time analysis algorithm combines static control flow analysis with a simulation-based cost model for execution and DRAM transfers. Its key novelty is the addition of a stage that considers work-group scheduling effects. I show that the work-group scheduling policies degrade performance on average by 8.9%, but permit the calculation of worst-case execution time bounds that are tight within 14.3% on average for benchmarks that avoid inefficient indexed transfers

Apollo (Cambridge)

Towards Parallel Programming Models for Predictability

Author
Publication venue: OASIcs - OpenAccess Series in Informatics. 12th International Workshop on Worst-Case Execution Time Analysis
Publication date: 01/01/2012
Field of study

Future embedded systems for performance-demanding applications will be massively parallel. High performance tasks will be parallel programs, running on several cores, rather than single threads running on single cores. For hard real-time applications, WCETs for such tasks must be bounded. Low-level parallel programming models, based on concurrent threads, are notoriously hard to use due to their inherent nondeterminism. Therefore the parallel processing community has long considered high-level parallel programming models, which restrict the low-level models to regain determinism. In this position paper we argue that such parallel programming models are beneficial also for WCET analysis of parallel programs. We review some proposed models, and discuss their influence on timing predictability. In particular we identify data parallel programming as a suitable paradigm as it is deterministic and allows current methods for WCET analysis to be extended to parallel code. GPUs are increasingly used for high performance applications: we discuss a current GPU architecture, and we argue that it offers a parallel platform for compute-intensive applications for which it seems possible to construct precise timing models. Thus, a promising route for the future is to develop WCET analyses for data-parallel software running on GPUs

Dagstuhl Research Online Publication Server

GPU devices for safety-critical systems: a survey

Author: Abella Ferrer Jaume
Calderón Torres Alejandro Josué
Cazorla Almeida Francisco Javier
Flores Barroso José Luis
Kosmidis Leonidas
Pérez Cerrolaza Jon
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/07/2023
Field of study

Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

GPU Wavefront Splitting for Safety-Critical Systems

Author: Klashtorny Artem
Publication venue: 'University of Waterloo'
Publication date: 04/10/2022
Field of study

Graphics processing units (GPUs) are compute platforms that are ideal for highly parallel workloads due to their high degree of hardware parallelism. Parallelism offered by GPUs lends itself well to machine learning and computer vision applications, including in safety-critical systems. Safety-critical systems require a guarantee of timing predictability. Guaranteeing timing predictability means being able to statically analyze the worst-case execution time (WCET) of the GPU program. Unfortunately, existing GPUs are designed for average-case performance and are thus not designed for timing predictability. Consequently, there is potential for research effort to provide these guarantees. Prior research works have proposed several new techniques to improve performance. One such technique is wavefront splitting, which reduces the number of idle threads on the GPU and increase utilization. However, no prior work addresses the WCET of this technique. The purpose of this thesis is to develop a GPU implementation for safety-critical systems that leverages wavefront splitting and to enable analysis of the WCET in such an implementation

University of Waterloo's Institutional Repository

Vector extensions in COTS processors to increase guaranteed performance in real-time systems

Author: Abella Ferrer Jaume
Cazorla Almeida Francisco Javier
Jorba Jorba Josep
Kosmidis Leonidas
Mezzetti Enrico
Pujol Torramorell Roger
Tabani Hamid
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/08/2022
Field of study

The need for increased application performance in high-integrity systems like those in avionics is on the rise as software continues to implement more complex functionalities. The prevalent computing solution for future high-integrity embedded products are multi-processors systems-on-chip (MPSoC) processors. MPSoCs include CPU multicores that enable improving performance via thread-level parallelism. MPSoCs also include generic accelerators (GPUs) and application-specific accelerators. However, the data processing approach (DPA) required to exploit each of these underlying parallel hardware blocks carries several open challenges to enable the safe deployment in high-integrity domains. The main challenges include the qualification of its associated runtime system and the difficulties in analyzing programs deploying the DPA with out-of-the-box timing analysis and code coverage tools. In this work, we perform a thorough analysis of vector extensions (VExt) in current COTS processors for high-integrity systems. We show that VExt prevent many of the challenges arising with parallel programming models and GPUs. Unlike other DPAs, VExt require no runtime support, prevent by design race conditions that might arise with parallel programming models, and have minimum impact on the software ecosystem enabling the use of existing code coverage and timing analysis tools. We develop vectorized versions of neural network kernels and show that the NVIDIA Xavier VExt provide a reasonable increase in guaranteed application performance of up to 2.7x. Our analysis contends that VExt are the DPA approach with arguably the fastest path for adoption in high-integrity systems.This work has received funding from the the European Research Council (ERC) grant agreement No. 772773 (SuPerCom) and the Spanish Ministry of Science and Innovation (AEI/10.13039/501100011033) under grants PID2019-107255GB-C21 and IJC2020-045931-I.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC