137 research outputs found

    NVIDIA Tensor Core Programmability, Performance & Precision

    Full text link
    The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

    TensorFlow Doing HPC

    Full text link
    TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HPC applications. However, very few experiments have been conducted to evaluate TensorFlow performance when running HPC workloads on supercomputers. This work addresses this lack by designing four traditional HPC benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG) solver and Fast Fourier Transform (FFT). We analyze their performance on two supercomputers with accelerators and evaluate the potential of TensorFlow for developing HPC applications. Our tests show that TensorFlow can fully take advantage of high performance networks and accelerators on supercomputers. Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical communication bandwidth on our testing platform. We find an approximately 2x, 1.7x and 1.8x performance improvement when increasing the number of GPUs from two to four in the matrix-matrix multiply, CG and FFT applications respectively. All our performance results demonstrate that TensorFlow has high potential of emerging also as HPC programming framework for heterogeneous supercomputers.Comment: Accepted for publication at The Ninth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES'19

    Towards enhancing coding productivity for GPU programming using static graphs

    Get PDF
    The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.his research was funded by EPEEC project from the European Union’s Horizon 2020 Research and Innovation program under grant agreement No. 801051. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, accessed on 13 April 2022).Peer ReviewedPostprint (published version

    Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

    Full text link
    We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and OpenACC), (3) Python (e.g., numba, Numba, cuPy, and pyCUDA), and (4) Julia (e.g., Threads, CUDA.jl, AMDGPU.jl, and KernelAbstractions.jl). We use the GitHub Copilot capabilities powered by OpenAI Codex available in Visual Studio Code as of April 2023 to generate a vast amount of implementations given simple + + prompt variants. To quantify and compare the results, we propose a proficiency metric around the initial 10 suggestions given for each prompt. Results suggest that the OpenAI Codex outputs for C++ correlate with the adoption and maturity of programming models. For example, OpenMP and CUDA score really high, whereas HIP is still lacking. We found that prompts from either a targeted language such as Fortran or the more general-purpose Python can benefit from adding code keywords, while Julia prompts perform acceptably well for its mature programming models (e.g., Threads and CUDA.jl). We expect for these benchmarks to provide a point of reference for each programming model's community. Overall, understanding the convergence of large language models, AI, and HPC is crucial due to its rapidly evolving nature and how it is redefining human-computer interactions.Comment: Accepted at the Sixteenth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2023 to be held in conjunction with ICPP 2023: The 52nd International Conference on Parallel Processing. 10 pages, 6 figures, 5 table

    Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

    Full text link
    We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work that is based on the OpenAI Codex, which is a descendant of GPT-3, to generate similar kernels with simple prompts via GitHub Copilot. Our goal is to compare the accuracy of Llama-2 and our original GPT-3 baseline by using a similar metric. Llama-2 has a simplified model that shows competitive or even superior accuracy. We also report on the differences between these foundational large language models as generative AI continues to redefine human-computer interactions. Overall, Copilot generates codes that are more reliable but less optimized, whereas codes generated by Llama-2 are less reliable but more optimized when correct.Comment: Accepted at LCPC 2023, The 36th International Workshop on Languages and Compilers for Parallel Computing http://www.lcpcworkshop.org/LCPC23/ . 13 pages, 5 figures, 1 tabl

    Julia as a unifying end-to-end workflow language on the Frontier exascale system

    Full text link
    We evaluate Julia as a single language and ecosystem paradigm powered by LLVM to develop workflow components for high-performance computing. We run a Gray-Scott, 2-variable diffusion-reaction application using a memory-bound, 7-point stencil kernel on Frontier, the US Department of Energy's first exascale supercomputer. We evaluate the performance, scaling, and trade-offs of (i) the computational kernel on AMD's MI250x GPUs, (ii) weak scaling up to 4,096 MPI processes/GPUs or 512 nodes, (iii) parallel I/O writes using the ADIOS2 library bindings, and (iv) Jupyter Notebooks for interactive analysis. Results suggest that although Julia generates a reasonable LLVM-IR, a nearly 50% performance difference exists vs. native AMD HIP stencil codes when running on the GPUs. As expected, we observed near-zero overhead when using MPI and parallel I/O bindings for system-wide installed implementations. Consequently, Julia emerges as a compelling high-performance and high-productivity workflow composition language, as measured on the fastest supercomputer in the world.Comment: 11 pages, 8 figures, accepted at the 18th Workshop on Workflows in Support of Large-Scale Science (WORKS23), IEEE/ACM The International Conference for High Performance Computing, Networking, Storage, and Analysis, SC2

    Impact of ultraviolet radiation on marine crustacean zooplankton and ichthyoplankton: a synthesis of results from the estuary and Gulf of St. Lawrence, Canada

    Get PDF
    The objectives of the research program reported upon here were (1) to measure ambient levels of UV radiation and determine whichvariables most strongly affected its attenuation in the waters of the estuary and Gulf of St. Lawrence, Canada; and (2) to investigate the potential direct impacts of W radiation on species of crustacean zooplankton and fish whose early life stages are planktonic. In this geographic region, productivity-determining biophysical interactions occur in the upper 0 to 30 m of the water column. Measurements of the diffuse attenuation coefficients for ultraviolet-B radiation (W-B, 280 to 320 nm) at various locations in this region indicated maximum 10% depths (the depth to which 10% of the surface energy penetrates at a given wavelength) of 3 to 4 m at a wavelength of 310 nm. Organisms residing in this layer-including the eggs and larvae of Calanus finmarchicus and Atlantic cod Gadus morhua-are exposed to biologically damaging levels of W radiation. As a result of these physical and biological characteristics, this system offered a relevant opportunity to assess the impacts of UV on subarctic marine ecosystems. Eggs of C. finmarchicus were incubated under the sun, with and without the W-B and/or UV-A (320 to 400 nm) wavebands. W-exposed eggs exhibited low percent hatchmg compared to those protected from W : W radiation had a strong negative impact on C. finmarchicus eggs. Further, percent hatching in W-B-exposed eggs was not significantly lower than that in eggs exposed to UV-A only: under natural sunlight, UV-A radiation appeared to be more detrimental to C. finmarchicus embryos than was UV-B. In analogous experiments with Atlantic cod eggs, exposure to UV-B produced a significant negative effect. However, UV-A had no negative effect on cod eggs. Additional experiments using a solar simulator (SS) revealed high wavelength-dependent mortality in both C. finmarchicus and cod embryos exposed to UV. The strongest effects occurred under exposures to wavelengths below 312 nm. At the shorter wavelengths (<305 nm) UV-B-induced mortality was strongly dose-dependent, but (for both C. finmarchicus and cod) not significantly influenced by dose-rate. Thus, at least within the limits of the exposures under which the biological weighting functions (BWFs) were generated, reciprocity held. The BWFs derived for UV-B-induced mortality in C. finmarchicus and cod eggs were similar in shape to the action spectrum for UV-B effects on naked DNA. Further, the wavelengthdependence of DNA damage was similar to that for the mortality effect. These observations suggest that W-induced mortality in C. finmarchicus and cod eggs is a direct result of DNA damage. There was no evidence of a detrimental effect of UV-A radiation in these SS-derived results. A mathematical model that includes the BWFs, vertical mixing of eggs, meteorological and hydrographic conditions, and ozone depletion, indicates that W-induced mortality in the C. finmarchicus egg population could be as high as 32.5 %, while the impact on the cod egg population was no more than 1.2%. Variability in cloud cover, water transparency (and the variables that affect it), and vertical distribution and displacement of planktonic organisms within the mixed layer can all have a greater effect on the flux of UV-B radiation to which they are exposed than will ozone layer depletion at these latitudes. Our observations indicate that C, finmarchicus and cod eggs present in the first meter of the water column (likely only a small percentage of the total egg populations) are susceptible to W radiation. However, although exposure to UV can negatively impact crustacean zooplankton and ichthyoplankton populations, these direct effects are likely minimal within the context of all the other environmental factors that produce the very high levels of mortality typically observed in their planktonic early life stages. The impact of indnect effects-which may well be of much greater import-has yet to be evaluated
    • …
    corecore