842 research outputs found
A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems
Heterogeneous systems are becoming more common on High Performance Computing
(HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task
to obtain optimal performance on the GPU. Approaches to simplifying this task
include Merge (a library based framework for heterogeneous multi-core systems),
Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a
new programming language for general purpose computation on the GPU) and
CUDA-lite (an enhancement to CUDA that transforms code based on annotations).
In addition, efforts are underway to improve compiler tools for automatic
parallelization and optimization of affine loop nests for GPUs and for
automatic translation of OpenMP parallelized codes to CUDA.
In this paper we present an alternative approach: a new computational
framework for the development of massively data parallel scientific codes
applications suitable for use on such petascale/exascale hybrid systems built
upon the highly scalable Cactus framework. As the first non-trivial
demonstration of its usefulness, we successfully developed a new 3D CFD code
that achieves improved performance.Comment: Parallel Computing 2011 (ParCo2011), 30 August -- 2 September 2011,
Ghent, Belgiu
Parallelization of the ADI method exploring vector computing in GPUs
Dissertação de mestrado integrado em Engenharia InformáticaThe 2D convection-diffusion is a well-known problem in scientific simulation that often uses
a direct method to solve a system of N linear equations, which requires N3 operations.
This problem can be solved using a more efficient computational method, known as the
alternating direction implicit (ADI). It solves a system of N linear equations in 2N times with
N operations each, implemented in two steps, one to solve row by row, the other column by
column. Each N operation is fully independent in each step, which opens an opportunity to
an embarrassingly parallel solution. This method also explores the way matrices are stored in
computer memory, either in row-major or column-major, by splitting each iteration in two.
The major bottleneck of this method is solving the system of linear equations. These
systems of linear equations can be described as tridiagonal matrices since the elements are
always stored on the three main diagonals of the matrices. Algorithms tailored for tridiagonal
matrices, can significantly improve the performance. These can be sequential (i.e. the Thomas
algorithm) or parallel (i.e. the cyclic reduction CR, and the parallel cyclic reduction PCR).
Current vector extensions in conventional scalar processing units, such as x86-64 and
ARM devices, require the vector elements to be in contiguous memory locations to avoid
performance penalties. To overcome these limitations in dot products several approaches
are proposed and evaluated in this work, both in general-purpose processing units and in
specific accelerators, namely NVidia GPUs.
Profiling the code execution on a server based on x86-64 devices showed that the ADI
method needs a combination of CPU computation power and memory transfer speed. This
is best showed on a server based on the Intel manycore device, KNL, where the algorithm
scales until the memory bandwidth is no longer enough to feed all 64 computing cores. A
dual-socket server based on 16-core Xeon Skylakes, with AVX-512 vector support, proved to
be a better choice: the algorithm executes in less time and scales better.
The introduction of GPU computing to further improve the execution performance (and
also using other optimisation techniques, namely a different thread scheme and shared
memory to speed up the process) showed better results for larger grid sizes (above 32Ki x
32Ki). The CUDA development environment also showed a better performance than using
OpenCL, in most cases. The largest difference was using a hybrid CR-PCR, where the OpenCL
code displayed a major performance improvement when compared to CUDA. But even with
this speedup, the better average time for the ADI method on all tested configurations on a
NVidia GPU was using CUDA on an available updated GPU (with a Pascal architecture) and
the CR as the auxiliary method.O problema da convecção-difusão é utilizado em simulaçãos cientificas que regularmente utilizam
métodos diretos para solucionar um sistema de N equações lineares e necessitam de N3 operações.
O problema pode ser resolvido utilizando um método computacionalmente mais eficiente para
resolver um sistema de N equações lineares com N operações cada, implementado em dois passos,
um solucionando linha a linha e outro solucionando coluna a coluna. Cada par de N operações
são independentes em cada passo, havendo assim uma oportunidade de utilizar uma solução em baraçosamente paralela. Este método também explora o modo de guardar as matrizes na memória do
computados, sendo esta por linhas ou em colunas, dividindo cada iteração em duas, este método é
conhecido como o método de direção alternada.
O maior bottleneck deste problema é a resolução dos sistemas de equações lineares criados pelo
ADI. Estes sistemas podem ser descritos como matrizes tridiagonais, visto todos os seus elementos se
encontrarem nas 3 diagonais interiores e a utilização de métodos estudados para este caso é necessário
para conseguir atingir a melhor performance possível. Esses métodos podem ser sequenciais (como o
algoritmo de Thomas) ou paralelos (como o CR e o PCR)
As extensões vectoriais utilizadas nas atuais unidades de processamento, como dispositivos x86-64
e ARM, necessitam que os elementos do vetor estejam em blocos de memória contíguos para não
sofrer penalizações. Algumas abordagens foram estudadas neste trabalho para as ultrapassar, tanto
em processadores convencionais como em aceleradores de computação. Os registos do tempo em
servidores baseado em dispositivos x86-64 mostram que o ADI necessitam de uma combinação de
poder de processamento assim como velocidade de transferência de dados. Isto é demonstrado
especialmente no servidor baseado no dispositivo KNL da Intel, no qual o algoritmo escala até que
a largura de banda deixe de ser suficiente para o problema. Um servidor com dois sockets em que
cada é composto por um dispositivo com 16 cores baseado na arquitetura Xeon Skylake, com acesso
ao AVX-512, mostrou ser a melhor escolha: o algoritmo faz as mesmas operações em menos tempo e
escala melhor.
Com a introdução de computação com GPUs para melhorar a performance do programa mostrou
melhores resultados para problemas de maiores dimensões (tamanho acima de 32Ki x 32Ki celulas).
O desenvolvimento em CUDA também mostrou melhores resultados que em OpenCL na maioria
dos casos. A maior divergência foi observada ao utilizar o método CR-PCR, onde o OpenCL mostrou
melhor performance que em CUDA. Mas mesmo com este método sendo mais eficaz que o mesmo em
CUDA, o melhor performance com o método ADI foi observado utilizando CUDA no GPU mais recente
estudado com o método CR
An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters
Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi-GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations
General‐purpose computation on GPUs for high performance cloud computing
This is the peer reviewed version of the following article: Expósito, R. R., Taboada, G. L., Ramos, S., Touriño, J., & Doallo, R. (2013). General‐purpose computation on GPUs for high performance cloud computing. Concurrency and Computation: Practice and Experience, 25(12), 1628-1642., which has been published in final form at https://doi.org/10.1002/cpe.2845. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] Cloud computing is offering new approaches for High Performance Computing (HPC) as it provides dynamically scalable resources as a service over the Internet. In addition, General‐Purpose computation on Graphical Processing Units (GPGPU) has gained much attention from scientific computing in multiple domains, thus becoming an important programming model in HPC. Compute Unified Device Architecture (CUDA) has been established as a popular programming model for GPGPUs, removing the need for using the graphics APIs for computing applications. Open Computing Language (OpenCL) is an emerging alternative not only for GPGPU but also for any parallel architecture. GPU clusters, usually programmed with a hybrid parallel paradigm mixing Message Passing Interface (MPI) with CUDA/OpenCL, are currently gaining high popularity. Therefore, cloud providers are deploying clusters with multiple GPUs per node and high‐speed network interconnects in order to make them a feasible option for HPC as a Service (HPCaaS). This paper evaluates GPGPU for high performance cloud computing on a public cloud computing infrastructure, Amazon EC2 Cluster GPU Instances (CGI), equipped with NVIDIA Tesla GPUs and a 10 Gigabit Ethernet network. The analysis of the results, obtained using up to 64 GPUs and 256‐processor cores, has shown that GPGPU is a viable option for high performance cloud computing despite the significant impact that virtualized environments still have on network overhead, which still hampers the adoption of GPGPU communication‐intensive applications. CopyrightMinisterio de Ciencia e Innovación; TIN2010-1673
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
- …