273 research outputs found
A framework for efficient execution of data parallel irregular applications on heterogeneous systems
Exploiting the computing power of the diversity of resources available on heterogeneous
systems is mandatory but a very challenging task. The diversity of architectures, execution
models and programming tools, together with disjoint address spaces and di erent
computing capabilities, raise a number of challenges that severely impact on application
performance and programming productivity. This problem is further compounded in the
presence of data parallel irregular applications.
This paper presents a framework that addresses development and execution of data
parallel irregular applications in heterogeneous systems. A uni ed task-based programming
and execution model is proposed, together with inter and intra-device scheduling,
which, coupled with a data management system, aim to achieve performance scalability
across multiple devices, while maintaining high programming productivity. Intradevice
scheduling on wide SIMD/SIMT architectures resorts to consumer-producer kernels,
which, by allowing dynamic generation and rescheduling of new work units, enable
balancing irregular workloads and increase resource utilization.
Results show that regular and irregular applications scale well with the number of
devices, while requiring minimal programming e ort. Consumer-producer kernels are
able to sustain signi cant performance gains as long as the workload per basic work
unit is enough to compensate overheads associated with intra-device scheduling. This
not being the case, consumer kernels can still be used for the irregular application.
Comparisons with an alternative framework, StarPU, which targets regular workloads,
consistently demonstrate signi cant speedups. This is, to the best of our knowledge, the
rst published integrated approach that successfully handles irregular workloads over
heterogeneous systems.This work is funded by National Funds through the FCT - Fundação para a Ciência
e a Tecnologia (Portuguese Foundation for Science and Technology) and by ERDF -
European Regional Development Fund through the COMPETE Programme (operational
programme for competitiveness) within projects PEst-OE/EEI/UI0752/2014
and FCOMP-01-0124-FEDER-010067. Also by the School of Engineering, Universidade
do Minho within project P2SHOCS - Performance Portability on Scalable
Heterogeneous Computing Systems
Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips
The trend in industry is towards heterogeneous multicore processors (HMCs),
including chips with CPUs and massively-threaded throughput-oriented processors
(MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the
cores with cache-coherent shared virtual memory (CCSVM), this is not the
communication paradigm used by any current HMC. In this paper, we present a
CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads
programming model, called xthreads, for programming this HMC. Our goal is to
evaluate the potential performance benefits of tightly coupling heterogeneous
cores with CCSVM
A fast GPU Monte Carlo Radiative Heat Transfer Implementation for Coupling with Direct Numerical Simulation
We implemented a fast Reciprocal Monte Carlo algorithm, to accurately solve
radiative heat transfer in turbulent flows of non-grey participating media that
can be coupled to fully resolved turbulent flows, namely to Direct Numerical
Simulation (DNS). The spectrally varying absorption coefficient is treated in a
narrow-band fashion with a correlated-k distribution. The implementation is
verified with analytical solutions and validated with results from literature
and line-by-line Monte Carlo computations. The method is implemented on GPU
with a thorough attention to memory transfer and computational efficiency. The
bottlenecks that dominate the computational expenses are addressed and several
techniques are proposed to optimize the GPU execution. By implementing the
proposed algorithmic accelerations, a speed-up of up to 3 orders of magnitude
can be achieved, while maintaining the same accuracy
Field-Configurable GPU
Nesta dissertação pretende-se desenvolver uma arquitetura de processamento dedicada destinada à aceleração de aplicações específicas, inspirada na estrutura de unidades de processamento do tipo GPU. A unidade de processamento deverá ser programável e configurável para os requisitos de aplicações específicas, sendo adaptada aos tipos e à quantidade de recursos lógicos disponíveis num dispositivo FPGA selecionado. Pretende-se que o acelerador consiga tirar o máximo partido dos recursos disponíveis num determinado dispositivo FPGA (memória, unidades aritméticas, recursos lógicos) com o objetivo de maximizar o desempenho de aplicações selecionadas. Serão consideradas aplicações alvo no domínio do processamento de imagem e de "machine learning". Uma vez selecionada uma arquitetura base, a especialização para uma aplicação (ou classe de aplicações) terá por base a configuração de 3 componentes fundamentais: organização do sistema de memória distribuída (construído com os blocos de memória RAM internos da FPGA), organização das unidades de processamento aritmético (que podem ser heterogéneas) e dimensão dos caminhos de dados. O sistema a desenvolver deverá ser desenhado ao nível RTL, em Verilog, e contemplar um processo automatizado para personalizar o acelerador a partir de um conjunto de especificações definidas com base nas características da aplicação alvo. Esse processo de personalização poderá ser feito com base na definição de parâmetros em Verilog, ou também recorrendo a aplicações dedicadas, a desenvolver, para gerar diretamente código Verilog. Deverá também ser desenvolvido um conjunto elementar de ferramentas de suporte, nomeadamente para geração do código a executar pelo processador. Como validação final, pretende-se integrar e demonstrar o acelerador num sistema de processamento de imagem em tempo real
Strong scaling of general-purpose molecular dynamics simulations on GPUs
We describe a highly optimized implementation of MPI domain decomposition in
a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson
and Glotzer, arXiv:1308.5587). Our approach is inspired by a traditional
CPU-based code, LAMMPS (Plimpton, J. Comp. Phys. 117, 1995), but is implemented
within a code that was designed for execution on GPUs from the start (Anderson
et al., J. Comp. Phys. 227, 2008). The software supports short-ranged pair
force and bond force fields and achieves optimal GPU performance using an
autotuning algorithm. We are able to demonstrate equivalent or superior scaling
on up to 3,375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD)
simulations of up to 108 million particles. GPUDirect RDMA capabilities in
recent GPU generations provide better performance in full double precision
calculations. For a representative polymer physics application, HOOMD-blue 1.0
provides an effective GPU vs. CPU node speed-up of 12.5x.Comment: 30 pages, 14 figure
- …