7 research outputs found

    CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

    Get PDF
    Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

    Effective kernel mapping for OpenCL applications in heterogeneous platforms

    Get PDF
    Many core accelerators are being deployed in many systems to improve the processing capabilities. In such systems, application mapping need to be enhanced to maximize the utilization of the underlying architecture. Especially in GPUs mapping becomes critical for multi-kernel applications as kernels may exhibit different characteristics. While some of the kernels run faster on GPU, others may refer to stay in CPU due to the high data transfer overhead. Thus, heterogeneous execution may yield to improved performance compared to executing the application only on CPU or only on GPU. In this paper, we propose a novel profiling-based kernel mapping algorithm to assign each kernel of an application to the proper device to improve the overall performance of an application. We use profiling information of kernels on different devices and generate a map that identifies which kernel should run on where to improve the overall performance of an application. Initial experiments show that our approach can effectively map kernels on CPU and GPU, and outperforms to a CPU-only and GPU-only approach. © 2012 IEEE

    Grafiikkaprosessorin hyödyntäminen tieteellisessä laskennassa

    Get PDF
    Tässä työssä tarkastellaan grafiikkaprosessoria ja sen hyödyntämistä tieteellisessä laskennassa. Työn tarkoituksena on tarkastella, miksi grafiikkaprosessori on sopiva työkalu tieteen eri sovelluksiin, miten sitä on mahdollista soveltaa sekä mihin eri tieteen alojen sovelluksiin siitä saadaan hyötyjä. Työssä käydään ensin läpi grafiikkaprosessorin teknisiä ja toiminnallisia ominaisuuksia sekä annetaan kuva grafiikkaprosessorin ja keskusprosessorin eroista ja yhteistyömahdollisuuksista. Tämän jälkeen käydään läpi rinnakkaisohjelmointi, jolla grafiikkaprosessorin laskentatehoa hyödynnetään. Sen jälkeen käydään läpi grafiikkaprosessorin hyödyntäminen tieteellisissä sovelluksissa

    Grafiikkaprosessorin hyödyntäminen tieteellisessä laskennassa

    Get PDF
    Tässä työssä tarkastellaan grafiikkaprosessoria ja sen hyödyntämistä tieteellisessä laskennassa. Työn tarkoituksena on tarkastella, miksi grafiikkaprosessori on sopiva työkalu tieteen eri sovelluksiin, miten sitä on mahdollista soveltaa sekä mihin eri tieteen alojen sovelluksiin siitä saadaan hyötyjä. Työssä käydään ensin läpi grafiikkaprosessorin teknisiä ja toiminnallisia ominaisuuksia sekä annetaan kuva grafiikkaprosessorin ja keskusprosessorin eroista ja yhteistyömahdollisuuksista. Tämän jälkeen käydään läpi rinnakkaisohjelmointi, jolla grafiikkaprosessorin laskentatehoa hyödynnetään. Sen jälkeen käydään läpi grafiikkaprosessorin hyödyntäminen tieteellisissä sovelluksissa

    Architecture-Aware Mapping and Optimization on a 1600-Core GPU

    No full text
    Abstract—The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for highperformance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task; it is a multi-dimensional problem that requires deep technical knowledge of GPU architecture. Although substantial literature exists on how to map and optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on an AMD GPU, such as the 1600-core AMD Radeon HD 5870 GPU. Consequently, we present and evaluate architecture-aware mapping and optimizations for the AMD GPU. The most prominent of which include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU mapping and optimizations by applying each in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, our optimized OpenCL implementation on an AMD Radeon HD 5870 delivers more than a fourfold improvement in performance over the basic OpenCL implementation. In addition, it outperforms our optimized CUDA version on an NVIDIA GTX280 by 12%. Overall, we achieve a speedup of 371-fold over a serial but hand-tuned SSE version of our molecular modeling application, and in turn, a 46-fold speedup over an ideal scaling on an 8-core CPU. Keywords-GPU; AMD; OpenCL; NVIDIA; CUDA; performance evaluation; I

    Scalable Parallel Optimization of Digital Signal Processing in the Fourier Domain

    Get PDF
    The aim of the research presented in this thesis is to study different approaches to the parallel optimization of digital signal processing algorithms and optical coherence tomography methods. The parallel approaches are based on multithreading for multi-core and many-core architectures. The thesis follows the process of designing and implementing the parallel algorithms and programs and their integration into optical coherence tomography systems. Evaluations of the performance and the scalability of the proposed parallel solutions are presented. The digital signal processing considered in this thesis is divided into two groups. The first one includes generally employed algorithms operating with digital signals in Fourier domain. Those include forward and inverse Fourier transform, cross-correlation, convolution and others. The second group involves optical coherence tomography methods, which incorporate the aforementioned algorithms. These methods are used to generate cross-sectional, en-face and confocal images. Identifying the optimal parallel approaches to these methods allows improvements in the generated imagery in terms of performance and content. The proposed parallel accelerations lead to the generation of comprehensive imagery in real-time. Providing detailed visual information in real-time improves the utilization of the optical coherence tomography systems, especially in areas such as ophthalmology

    New Techniques for On-line Testing and Fault Mitigation in GPUs

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen
    corecore