Search CORE

15 research outputs found

An Evaluation of Emerging Many-Core Parallel Programming Models

Author: Boulton Michael
Gaudin Wayne
Martineau Matt J
McIntosh-Smith Simon N
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/03/2016
Field of study

Crossref

Explore Bristol Research

A Fast Runtime Visualization of a GPU-Based 3D-FDTD Electromagnetic Simulation

Author: Aoki Kota
Dohi Keisuke
Fujimoto Takafumi
Oguri Kiyoshi
Shibata Yuichiro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

In this paper, we present design and implementation of a fast runtime visualizer for a GPU-based 3D-FDTD electromagnetic simulation. We focus on improving the productivity of simulator development without compromising simulation performance. In order to keep the portability, we implemented a visualizer with the MVC model, where simulation kernels and visualization process were completely separated. For high-speed visualization, an interoperability mechanism between OpenGL and CUDA was used in addition to efficient utilization of programmable shaders. We also propose an asynchronous multi-threaded execution with a triple-buffering technique so that developers can concentrate on developing their simulation kernels. As a result of empirical visualization experiments of electromagnetic simulations for practical antenna design, it was revealed that our implementation achieved a rendering throughput of 90 FPS for a view port of 512 x 512 pixels, which corresponds to a 12.9 times speedup compared to when the OpenGL-CUDA interoperability mechanism was not utilized. When a standard visualization throughput of 60 FPS was selected, the performance overhead imposed by the visualization process was 15.8%, which was reasonably low compared to a speedup of the simulation kernel gained by the GPU acceleration

Crossref

Nagasaki University's Academic Output SITE: NAOSITE

Institutional Repositories DataBase (IRDB)

Nagasaki university's Academic Output SITE

Achieving performance portability for a heat conduction solver mini-application on modern multi-core systems

Author: Jarvis Stephen A.
Kirk Richard O.
Martineau Matt J.
Mudalige Gihan R.
Reguly Istvan Z.
Wright Steven A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Modernizing production-grade, often legacy applications to take advantage of modern multi-core and many-core architectures can be a difficult and costly undertaking. This is especially true currently, as it is unclear which architectures will dominate future systems. The complexity of these codes can mean that parallelisation for a given architecture requires significant re-engineering. One way to assess the benefit of such an exercise would be to use mini-Applications that are representative of the legacy programs.In this paper, we investigate different implementations of TeaLeaf, a mini-Application from the Mantevo suite that solves the linear heat conduction equation. TeaLeaf has been ported to use many parallel programming models, including OpenMP, CUDA and MPI among others. It has also been re-engineered to use the OPS embedded DSL and template libraries Kokkos and RAJA. We use these different implementations to assess the performance portability of each technique on modern multi-core systems.While manually parallelising the application targeting and optimizing for each platform gives the best performance, this has the obvious disadvantage that it requires the creation of different versions for each and every platform of interest. Frameworks such as OPS, Kokkos and RAJA can produce executables of the program automatically that achieve comparable portability. Based on a recently developed performance portability metric, our results show that OPS and RAJA achieve an application performance portability score of 71% and 77% respectively for this application

Crossref

University of Birmingham Research Portal

Warwick Research Archives Portal Repository

Repository of the Academy's Library

White Rose Research Online

Towards Chapel-based Exascale Tree Search Algorithms: dealing with multiple GPU accelerators

Author: Carneiro Tiago
Hayashi Akihiro
Melab Nouredine
Sarkar Vivek
Publication venue: HAL CCSD
Publication date: 22/03/2021
Field of study

International audienceTree-based search algorithms applied to combinatorial optimization problems are highly irregular and time consuming when it comes to solving big instances. Solving such instances efficiently requires the use of massively parallel distributed-memory supercomputers. According to recent Top 500 trends, the degree of parallelism in these supercomputers continues to increase in size and complexity, with millions of heterogeneous (mainly CPU-GPU) cores. Harnessing this scale of computing resources raises at least three challenging issues which are described and addressed in this paper. Indeed, as a step towards exascale computing, we revisit the design and implementation of tree search algorithms dealing with multiple GPUs, in addition to scalability and productivity-awareness using Chapel. The proposed algorithm exploits Chapel's distributed iterators by combining a partial search strategy with pre-compiled CUDA kernels for more efficient exploitation of the intra-node parallelism. Extensive experimentation on big N-Queens problem instances using 24 GPUs shows that up to 90% of the linear speedup can be achieved

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Multi-GPU support on the marrow algorithmic skeleton framework

Author: Alexandre Fernando Jorge Marques
Publication venue: Faculdade de Ciências e Tecnologia
Publication date: 01/01/2013
Field of study

Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems. Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations such as transparent load balancing mechanism and reduced explicit code overhead. Algorithmic Skeletons, previously used in cluster environments, have recently been adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by defining computation patterns of commonly used algorithms. The Marrow algorithmic skeleton library is one of these, taking advantage of the abstractions to automate the orchestration needed for an efficient GPU execution. This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine. We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200

Repositório da Universidade Nova de Lisboa

On the Porting and Optimisation of Physics Simulations for Heterogeneous Parallel Processors

Author: Martineau Matt J
Publication venue
Publication date: 25/06/2019
Field of study

Explore Bristol Research

Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models

Author: Zhang Hui
Publication venue
Publication date: 01/01/2018
Field of study

Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance. In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels

Digital Repository at the University of Maryland

Enhancing Monte Carlo Particle Transport for Modern Many-Core Architectures

Author: Bleile Ryan
Publication venue: University of Oregon
Publication date: 29/04/2021
Field of study

Since near the very beginning of electronic computing, Monte Carlo particle transport has been a fundamental approach for solving computational physics problems. Due to the high computational demands and inherently parallel nature of these applications, Monte Carlo transport applications are often performed in the supercomputing environment. That said, supercomputers are changing, as parallelism has dramatically increased with each supercomputer node, including regular inclusion of many-core devices. Monte Carlo transport, like all applications that run on supercomputers, will be forced to make significant changes to their designs in order to utilize these new architectures effectively. This dissertation presents solutions for central challenges that face Monte Carlo particle transport in this changing environment, specifically in the areas of threading models, tracking algorithms, tally data collection, and heterogenous load balancing. In addition, the dissertation culminates with a study that combines all of the presented techniques in a production application at scale on Lawrence Livermore National Laboratory's RZAnsel Supercomputer

University of Oregon Scholars' Bank