313 research outputs found

    Using the High Productivity Language Chapel to Target GPGPU Architectures

    Get PDF
    It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.NSF CCF 0702260Cray Inc. Cray-SRA-2010-016962010-2011 Nvidia Research Fellowshipunpublishednot peer reviewe

    Linpack evaluation on a supercomputer with heterogeneous accelerators

    Full text link
    Abstract—We report Linpack benchmark results on the TSUBAME supercomputer, a large scale heterogeneous system equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, we have achieved 87.01TFlops, which is the third record as a heterogeneous system in the world. This paper describes careful tuning and load balancing method required to achieve this performance. On the other hand, since the peak speed is 163 TFlops, the efficiency is 53%, which is lower than other systems. This paper also analyses this gap from the aspect of system architecture. I

    Easing parallel programming on heterogeneous systems

    Get PDF
    El modo más frecuente de resolver aplicaciones de HPC (High performance Computing) en tiempos de ejecución razonables y de una forma escalable es mediante el uso de sistemas de cómputo paralelo. La tendencia actual en los sistemas de HPC es la inclusión en la misma máquina de ejecución de varios dispositivos de cómputo, de diferente tipo y arquitectura. Sin embargo, su uso impone al programador retos específicos. Un programador debe ser experto en las herramientas y abstracciones existentes para memoria distribuida, los modelos de programación para sistemas de memoria compartida, y los modelos de programación específicos para para cada tipo de co-procesador, con el fin de crear programas híbridos que puedan explotar eficientemente todas las capacidades de la máquina. Actualmente, todos estos problemas deben ser resueltos por el programador, haciendo así la programación de una máquina heterogénea un auténtico reto. Esta Tesis trata varios de los problemas principales relacionados con la programación en paralelo de los sistemas altamente heterogéneos y distribuidos. En ella se realizan propuestas que resuelven problemas que van desde la creación de códigos portables entre diferentes tipos de dispositivos, aceleradores, y arquitecturas, consiguiendo a su vez máxima eficiencia, hasta los problemas que aparecen en los sistemas de memoria distribuida relacionados con las comunicaciones y la partición de estructuras de datosDepartamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic

    Automatic Performance Optimization on Heterogeneous Computer Systems using Manycore Coprocessors

    Get PDF
    Emerging computer architectures and advanced computing technologies, such as Intel’s Many Integrated Core (MIC) Architecture and graphics processing units (GPU), provide a promising solution to employ parallelism for achieving high performance, scalability and low power consumption. As a result, accelerators have become a crucial part in developing supercomputers. Accelerators usually equip with different types of cores and memory. It will compel application developers to reach challenging performance goals. The added complexity has led to the development of task-based runtime systems, which allow complex computations to be expressed as task graphs, and rely on scheduling algorithms to perform load balancing between all resources of the platforms. Developing good scheduling algorithms, even on a single node, and analyzing them can thus have a very high impact on the performance of current HPC systems. Load balancing strategies, at different levels, will be critical to obtain an effective usage of the heterogeneous hardware and to reduce the impact of communication on energy and performance. Implementing efficient load balancing algorithms, able to manage heterogeneous hardware, can be a challenging task, especially when a parallel programming model for distributed memory architecture. In this paper, we presents several novel runtime approaches to determine the optimal data and task partition on heterogeneous platforms, targeting the Intel Xeon Phi accelerated heterogeneous systems

    GPU Communication Performance Engineering for the Lattice Boltzmann Method

    Get PDF
    Orientador : Prof. Dr. Daniel WeingaertnerDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 10/08/2016Inclui referências : f. 59-62Área de concentração: Ciência da computaçãoResumo: A crescente importância do uso de GPUs para computação de propósito geral em supercomputadores faz com que o bom suporte a GPUs seja uma característica valiosa de frameworks de software para computação de alto desempenho como o waLBerla. waLBerla é um framework de software altamente paralelo que suporta uma ampla gama de fenômenos físicos. Embora apresente um bom desempenho em CPUs, testes demonstraram que as suas soluções de comunicação para GPU têm um desempenho ruim. Neste trabalho são apresentadas soluções para melhorar o desempenho, a eficiência do uso de memória e a usabilidade do waLBerla em supercomputadores baseados em GPU. A infraestrutura de comunicação proposta para GPUs NVIDIA com suporte a CUDA mostrou-se 25 vezes mais rápida do que o mecanismo de comunicação para GPU disponíveis anteriormente no waLBerla. Nossa solução para melhorar a eficiência do uso de memória da GPU permite usar 55% da memória necessária por uma abordagem simplista, o que possibilita executar simulações com domínios maiores ou usar menos GPUs para um determinado tamanho de domínio. Adicionalmente, levando-se em consideração que o desempenho de kernels CUDA se mostrou altamente sensível ao modo como a memória da GPU é acessada e a detalhes de implementação, foi proposto um mecanismo de indexação flexível de domínio que permite configurar as dimensões dos blocos de threads. Além disso, uma aplicação do Lattice Boltzmann Method (LBM) foi desenvolvida com kernels CUDA altamente otimizados a fim de se realizar todos os experimentos e testar todas as soluções propostas para o waLBerla. Palavras-chave: HPC, GPU, CUDA, Comunicação, Memória, Lattice Boltzmann Method, waLBerla.Abstract: The increasing importance of GPUs for general-purpose computation on supercomputers makes a good GPU support by High-Performance Computing (HPC) software frameworks such as waLBerla a valuable feature. waLBerla is a massively parallel software framework that supports a wide range of physical phenomena. Although it presents good performance on CPUs, tests have shown that its available GPU communication solutions perform poorly. In this work, we present solutions for improving waLBerla's performance, memory usage e_ciency and usability on GPUbased supercomputers. The proposed communication infrastructure for CUDA-enabled NVIDIA GPUs executed 25 times faster than the GPU communication mechanism previously available on waLBerla. Our solution for improving GPU memory usage e_ciency allowed for using 55% of the memory required by a naive approach, which makes possible for running simulations with larger domains or using fewer GPUs for a given domain size. In addition, as CUDA kernel performance showed to be very sensitive to the way data is accessed in GPU memory and kernel implementation details, we proposed a flexible domain indexing mechanism that allows for configuring thread block sizes. Finally, a Lattice Boltzmann Method (LBM) application was developed with highly optimized CUDA kernels in order to carry out all experiments and test all proposed solutions for waLBerla. Keywords: HPC, GPU, CUDA, Communication, Memory, Lattice Boltzmann Method, waLBerla
    corecore