7 research outputs found
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Heterogeneous supercomputers that incorporate computational accelerators
such as GPUs are increasingly popular due to their high
peak performance, energy efficiency and comparatively low cost.
Unfortunately, the programming models and frameworks designed
to extract performance from all computational units still lack the
flexibility of their CPU-only counterparts. Accelerated OpenMP
improves this situation by supporting natural migration of OpenMP
code from CPUs to a GPU. However, these implementations currently
lose one of OpenMP’s best features, its flexibility: typical
OpenMP applications can run on any number of CPUs. GPU implementations
do not transparently employ multiple GPUs on a node
or a mix of GPUs and CPUs. To address these shortcomings, we
present CoreTSAR, our runtime library for dynamically scheduling
tasks across heterogeneous resources, and propose straightforward
extensions that incorporate this functionality into Accelerated
OpenMP. We show that our approach can provide nearly linear
speedup to four GPUs over only using CPUs or one GPU while
increasing the overall flexibility of Accelerated OpenMP
Effective kernel mapping for OpenCL applications in heterogeneous platforms
Many core accelerators are being deployed in many systems to improve the processing capabilities. In such systems, application mapping need to be enhanced to maximize the utilization of the underlying architecture. Especially in GPUs mapping becomes critical for multi-kernel applications as kernels may exhibit different characteristics. While some of the kernels run faster on GPU, others may refer to stay in CPU due to the high data transfer overhead. Thus, heterogeneous execution may yield to improved performance compared to executing the application only on CPU or only on GPU. In this paper, we propose a novel profiling-based kernel mapping algorithm to assign each kernel of an application to the proper device to improve the overall performance of an application. We use profiling information of kernels on different devices and generate a map that identifies which kernel should run on where to improve the overall performance of an application. Initial experiments show that our approach can effectively map kernels on CPU and GPU, and outperforms to a CPU-only and GPU-only approach. © 2012 IEEE
Grafiikkaprosessorin hyödyntäminen tieteellisessä laskennassa
Tässä työssä tarkastellaan grafiikkaprosessoria ja sen hyödyntämistä tieteellisessä laskennassa. Työn tarkoituksena on tarkastella, miksi grafiikkaprosessori on sopiva työkalu tieteen eri sovelluksiin, miten sitä on mahdollista soveltaa sekä mihin eri tieteen alojen sovelluksiin siitä saadaan hyötyjä. Työssä käydään ensin läpi grafiikkaprosessorin teknisiä ja toiminnallisia ominaisuuksia sekä annetaan kuva grafiikkaprosessorin ja keskusprosessorin eroista ja yhteistyömahdollisuuksista. Tämän jälkeen käydään läpi rinnakkaisohjelmointi, jolla grafiikkaprosessorin laskentatehoa hyödynnetään. Sen jälkeen käydään läpi grafiikkaprosessorin hyödyntäminen tieteellisissä sovelluksissa
Grafiikkaprosessorin hyödyntäminen tieteellisessä laskennassa
Tässä työssä tarkastellaan grafiikkaprosessoria ja sen hyödyntämistä tieteellisessä laskennassa. Työn tarkoituksena on tarkastella, miksi grafiikkaprosessori on sopiva työkalu tieteen eri sovelluksiin, miten sitä on mahdollista soveltaa sekä mihin eri tieteen alojen sovelluksiin siitä saadaan hyötyjä. Työssä käydään ensin läpi grafiikkaprosessorin teknisiä ja toiminnallisia ominaisuuksia sekä annetaan kuva grafiikkaprosessorin ja keskusprosessorin eroista ja yhteistyömahdollisuuksista. Tämän jälkeen käydään läpi rinnakkaisohjelmointi, jolla grafiikkaprosessorin laskentatehoa hyödynnetään. Sen jälkeen käydään läpi grafiikkaprosessorin hyödyntäminen tieteellisissä sovelluksissa
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
Abstract—The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for highperformance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task; it is a multi-dimensional problem that requires deep technical knowledge of GPU architecture. Although substantial literature exists on how to map and optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on an AMD GPU, such as the 1600-core AMD Radeon HD 5870 GPU. Consequently, we present and evaluate architecture-aware mapping and optimizations for the AMD GPU. The most prominent of which include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU mapping and optimizations by applying each in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, our optimized OpenCL implementation on an AMD Radeon HD 5870 delivers more than a fourfold improvement in performance over the basic OpenCL implementation. In addition, it outperforms our optimized CUDA version on an NVIDIA GTX280 by 12%. Overall, we achieve a speedup of 371-fold over a serial but hand-tuned SSE version of our molecular modeling application, and in turn, a 46-fold speedup over an ideal scaling on an 8-core CPU. Keywords-GPU; AMD; OpenCL; NVIDIA; CUDA; performance evaluation; I
Scalable Parallel Optimization of Digital Signal Processing in the Fourier Domain
The aim of the research presented in this thesis is to study different approaches to the parallel optimization of digital signal processing algorithms and optical coherence tomography methods. The parallel approaches are based on multithreading for multi-core and many-core architectures. The thesis follows the process of designing and implementing the parallel algorithms and programs and their integration into optical coherence tomography systems. Evaluations of the performance and the scalability of the proposed parallel solutions are presented. The digital signal processing considered in this thesis is divided into two groups. The first one includes generally employed algorithms operating with digital signals in Fourier domain. Those include forward and inverse Fourier transform, cross-correlation, convolution and others. The second group involves optical coherence tomography methods, which incorporate the aforementioned algorithms. These methods are used to generate cross-sectional, en-face and confocal images. Identifying the optimal parallel approaches to these methods allows improvements in the generated imagery in terms of performance and content. The proposed parallel accelerations lead to the generation of comprehensive imagery in real-time. Providing detailed visual information in real-time improves the utilization of the optical coherence tomography systems, especially in areas such as ophthalmology
New Techniques for On-line Testing and Fault Mitigation in GPUs
L'abstract è presente nell'allegato / the abstract is in the attachmen