Search CORE

5 research outputs found

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs

Author: Juurlink Ben
Mammeri Nadjib
Publication venue
Publication date: 19/10/2018
Field of study

GPUs have become immensely important computational units on embedded and mobile devices. However, GPGPU developers are often not able to exploit the compute power offered by GPUs on these devices mainly due to the lack of support of traditional programming models such as CUDA and OpenCL. The recent introduction of the Vulkan API provides a new programming model that could be explored for GPGPU computing on these devices, as it supports compute and promises to be portable across different architectures. In this paper we propose VComputeBench, a set of benchmarks that help developers understand the differences in performance and portability of Vulkan. We also evaluate the suitability of Vulkan as an emerging cross-platform GPGPU framework by conducting a thorough analysis of its performance compared to CUDA and OpenCL on mobile as well as on desktop platforms. Our experiments show that Vulkan provides better platform support on mobile devices and can be regarded as a good crossplatform GPGPU framework. It offers comparable performance and with some low-level optimizations it can offer average speedups of 1.53x and 1.66x compared to CUDA and OpenCL respectively on desktop platforms and 1.59x average speedup compared to OpenCL on mobile platforms. However, while Vulkan’s low-level control can enhance performance, it requires a significantly higher programming effort.EC/H2020/688759/EU/Low-Power Parallel Computing on GPUs 2/LPGPU

DepositOnce

Crossref

A domain decomposition strategy for a very high-order finite volumes scheme applied to cardiac electrophysiology

Author: Coudière Yves
Turpault Rodolphe
Publication venue: 'Elsevier BV'
Publication date: 01/10/2019
Field of study

International audienceIn this paper, a domain decomposition technique for a very high-order finite volumes scheme is proposed. The objective is to obtain an efficient way to perform numerical simulations in cardiac electrophysiology. The aim is to extend a very high-order numerical scheme previously designed, where large stencils are used for polynomial reconstructions. Therefore, a particular attention has to be paid to maintain the scalability in parallel. Here, we propose to constrain the stencils inside the subdomains or their first layer of neighbors. The method is shown to remain accurate and to scale perfectly up to the level where there are not enough cells in the subdomains. Hence, these high-order schemes are proved to be efficient tools to perform realistic simulations in cardiac electrophysiology

INRIA a CCSD electronic archive server

Oskar Bordeaux

Doctor of Philosophy

Author: Fu Zhisong
Publication venue: University of Utah
Publication date: 01/12/2013
Field of study

dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving

The University of Utah: J. Willard Marriott Digital Library

Performance engineering of data-intensive applications

Author: Mazaheri Arya
Publication venue
Publication date: 01/01/2022
Field of study

Data-intensive programs deal with big chunks of data and often contain compute-intensive characteristics. Among various HPC application domains, big data analytics, machine learning and the more recent deep-learning models are well-known data-intensive applications. An efficient design of such applications demands extensive knowledge of the target hardware and software, particularly the memory/cache hierarchy and the data communication among threads/processes. Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the code to other platforms. In this dissertation, we introduce a set of tools and methods for the performance engineering of parallel data-intensive programs. We start with performance profiling to gain insights on thread communications and relevant code optimizations. Then, by narrowing down our scope to deep-learning applications, we introduce our tools for enhancing the performance portability and scalability of convolutional neural networks (ConvNet) at inference and training phases. Our first contribution is a novel performance-profiling method to unveil potential communication bottlenecks caused by data-access patterns and thread interactions. Our findings show that the data shared between a pair of threads should be reused with a reasonably short intervals to preserve data locality, yet existing profilers neglect them and mainly report the communication volume. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. Our experiments show that applying relevant optimizations improves the performance in Rodinia benchmarks by up to 56%. For the next contribution, we developed a framework for automatic generation of efficient and performance-portable convolution kernels, including Winograd convolutions, for various GPU platforms. We employed a synergy of meta-programming, symbolic execution, and auto-tuning. The results demonstrate efficient kernels generated through an automated optimization pipeline with runtimes close to vendor deep-learning libraries, and the minimum required programming effort confirms the performance portability of our approach. Furthermore, our symbolic execution method exploits repetitive patterns in Winograd convolutions, enabling us to reduce the number of arithmetic operations by up to 62% without compromising the numerical stability. Lastly, we investigate possible methods to scale the performance of ConvNets in training and inference phases. Our specialized training platform equipped with a novel topology-aware network pruning algorithm enables rapid training, neural architecture search, and network compression. Thus, an AI model training can be easily scaled to a multitude of compute nodes, leading to faster model design with less operating costs. Furthermore, the network compression component scales a ConvNet model down by removing redundant layers, preparing the model for a more pertinent deployment. Altogether, this work demonstrates the necessity and shows the benefit of performance engineering and parallel programming methods in accelerating emerging data-intensive workloads. With the help of the proposed tools and techniques, we pinpoint data communication bottlenecks and achieve performance portability and scalability in data-intensive applications

TUbiblio

tuprints

Proceedings of the International Workshop "Innovation Information Technologies: Theory and Practice": Dresden, Germany, September 06-10.2010

Author: Iskhakova Liliya
Konrad Uwe
Publication venue: Forschungszentrum Dresden-Rossendorf
Publication date: 01/01/2010
Field of study

This International Workshop is a high quality seminar providing a forum for the exchange of scientific achievements between research communities of different universities and research institutes in the area of innovation information technologies. It is a continuation of the Russian-German Workshops that have been organized by the universities in Dresden, Karlsruhe and Ufa before. The workshop was arranged in 9 sessions covering the major topics: Modern Trends in Information Technology, Knowledge Based Systems and Semantic Modelling, Software Technology and High Performance Computing, Geo-Information Systems and Virtual Reality, System and Process Engineering, Process Control and Management and Corporate Information Systems

Qucosa – Hemholtz-Zentrum Dresden-Rossendorf