137 research outputs found
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
TensorFlow Doing HPC
TensorFlow is a popular emerging open-source programming framework supporting
the execution of distributed applications on heterogeneous hardware. While
TensorFlow has been initially designed for developing Machine Learning (ML)
applications, in fact TensorFlow aims at supporting the development of a much
broader range of application kinds that are outside the ML domain and can
possibly include HPC applications. However, very few experiments have been
conducted to evaluate TensorFlow performance when running HPC workloads on
supercomputers. This work addresses this lack by designing four traditional HPC
benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG)
solver and Fast Fourier Transform (FFT). We analyze their performance on two
supercomputers with accelerators and evaluate the potential of TensorFlow for
developing HPC applications. Our tests show that TensorFlow can fully take
advantage of high performance networks and accelerators on supercomputers.
Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical
communication bandwidth on our testing platform. We find an approximately 2x,
1.7x and 1.8x performance improvement when increasing the number of GPUs from
two to four in the matrix-matrix multiply, CG and FFT applications
respectively. All our performance results demonstrate that TensorFlow has high
potential of emerging also as HPC programming framework for heterogeneous
supercomputers.Comment: Accepted for publication at The Ninth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES'19
Towards enhancing coding productivity for GPU programming using static graphs
The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.his research was funded by EPEEC project from the European Union’s Horizon 2020 Research and Innovation program under grant agreement No. 801051. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, accessed on 13 April 2022).Peer ReviewedPostprint (published version
Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation
We evaluate AI-assisted generative capabilities on fundamental numerical
kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV,
Jacobi Stencil, and CG. We test the generated kernel codes for a variety of
language-supported programming models, including (1) C++ (e.g., OpenMP
[including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g.,
OpenMP [including offload] and OpenACC), (3) Python (e.g., numba, Numba, cuPy,
and pyCUDA), and (4) Julia (e.g., Threads, CUDA.jl, AMDGPU.jl, and
KernelAbstractions.jl). We use the GitHub Copilot capabilities powered by
OpenAI Codex available in Visual Studio Code as of April 2023 to generate a
vast amount of implementations given simple + +
prompt variants. To quantify and compare the results, we
propose a proficiency metric around the initial 10 suggestions given for each
prompt. Results suggest that the OpenAI Codex outputs for C++ correlate with
the adoption and maturity of programming models. For example, OpenMP and CUDA
score really high, whereas HIP is still lacking. We found that prompts from
either a targeted language such as Fortran or the more general-purpose Python
can benefit from adding code keywords, while Julia prompts perform acceptably
well for its mature programming models (e.g., Threads and CUDA.jl). We expect
for these benchmarks to provide a point of reference for each programming
model's community. Overall, understanding the convergence of large language
models, AI, and HPC is crucial due to its rapidly evolving nature and how it is
redefining human-computer interactions.Comment: Accepted at the Sixteenth International Workshop on Parallel
Programming Models and Systems Software for High-End Computing (P2S2), 2023
to be held in conjunction with ICPP 2023: The 52nd International Conference
on Parallel Processing. 10 pages, 6 figures, 5 table
Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation
We evaluate the use of the open-source Llama-2 model for generating
well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on
different parallel programming models and languages (e.g., C++: OpenMP, OpenMP
Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python:
numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built
upon our previous work that is based on the OpenAI Codex, which is a descendant
of GPT-3, to generate similar kernels with simple prompts via GitHub Copilot.
Our goal is to compare the accuracy of Llama-2 and our original GPT-3 baseline
by using a similar metric. Llama-2 has a simplified model that shows
competitive or even superior accuracy. We also report on the differences
between these foundational large language models as generative AI continues to
redefine human-computer interactions. Overall, Copilot generates codes that are
more reliable but less optimized, whereas codes generated by Llama-2 are less
reliable but more optimized when correct.Comment: Accepted at LCPC 2023, The 36th International Workshop on Languages
and Compilers for Parallel Computing http://www.lcpcworkshop.org/LCPC23/ . 13
pages, 5 figures, 1 tabl
Julia as a unifying end-to-end workflow language on the Frontier exascale system
We evaluate Julia as a single language and ecosystem paradigm powered by LLVM
to develop workflow components for high-performance computing. We run a
Gray-Scott, 2-variable diffusion-reaction application using a memory-bound,
7-point stencil kernel on Frontier, the US Department of Energy's first
exascale supercomputer. We evaluate the performance, scaling, and trade-offs of
(i) the computational kernel on AMD's MI250x GPUs, (ii) weak scaling up to
4,096 MPI processes/GPUs or 512 nodes, (iii) parallel I/O writes using the
ADIOS2 library bindings, and (iv) Jupyter Notebooks for interactive analysis.
Results suggest that although Julia generates a reasonable LLVM-IR, a nearly
50% performance difference exists vs. native AMD HIP stencil codes when running
on the GPUs. As expected, we observed near-zero overhead when using MPI and
parallel I/O bindings for system-wide installed implementations. Consequently,
Julia emerges as a compelling high-performance and high-productivity workflow
composition language, as measured on the fastest supercomputer in the world.Comment: 11 pages, 8 figures, accepted at the 18th Workshop on Workflows in
Support of Large-Scale Science (WORKS23), IEEE/ACM The International
Conference for High Performance Computing, Networking, Storage, and Analysis,
SC2
Impact of ultraviolet radiation on marine crustacean zooplankton and ichthyoplankton: a synthesis of results from the estuary and Gulf of St. Lawrence, Canada
The objectives of the research program reported upon here were (1) to measure ambient levels of UV radiation and
determine whichvariables most strongly affected its attenuation in the waters of the estuary and Gulf of St. Lawrence, Canada; and
(2) to investigate the potential direct impacts of W radiation on species of crustacean zooplankton and fish whose early life stages
are planktonic. In this geographic region, productivity-determining biophysical interactions occur in the upper 0 to 30 m of the
water column. Measurements of the diffuse attenuation coefficients for ultraviolet-B radiation (W-B, 280 to 320 nm) at various
locations in this region indicated maximum 10% depths (the depth to which 10% of the surface energy penetrates at a given wavelength)
of 3 to 4 m at a wavelength of 310 nm. Organisms residing in this layer-including the eggs and larvae of Calanus finmarchicus
and Atlantic cod Gadus morhua-are exposed to biologically damaging levels of W radiation. As a result of these physical
and biological characteristics, this system offered a relevant opportunity to assess the impacts of UV on subarctic marine
ecosystems. Eggs of C. finmarchicus were incubated under the sun, with and without the W-B and/or UV-A (320 to 400 nm) wavebands.
W-exposed eggs exhibited low percent hatchmg compared to those protected from W : W radiation had a strong negative
impact on C. finmarchicus eggs. Further, percent hatching in W-B-exposed eggs was not significantly lower than that in eggs
exposed to UV-A only: under natural sunlight, UV-A radiation appeared to be more detrimental to C. finmarchicus embryos than
was UV-B. In analogous experiments with Atlantic cod eggs, exposure to UV-B produced a significant negative effect. However,
UV-A had no negative effect on cod eggs. Additional experiments using a solar simulator (SS) revealed high wavelength-dependent
mortality in both C. finmarchicus and cod embryos exposed to UV. The strongest effects occurred under exposures to wavelengths
below 312 nm. At the shorter wavelengths (<305 nm) UV-B-induced mortality was strongly dose-dependent, but (for both
C. finmarchicus and cod) not significantly influenced by dose-rate. Thus, at least within the limits of the exposures under which the
biological weighting functions (BWFs) were generated, reciprocity held. The BWFs derived for UV-B-induced mortality in C. finmarchicus
and cod eggs were similar in shape to the action spectrum for UV-B effects on naked DNA. Further, the wavelengthdependence
of DNA damage was similar to that for the mortality effect. These observations suggest that W-induced mortality in
C. finmarchicus and cod eggs is a direct result of DNA damage. There was no evidence of a detrimental effect of UV-A radiation in
these SS-derived results. A mathematical model that includes the BWFs, vertical mixing of eggs, meteorological and hydrographic
conditions, and ozone depletion, indicates that W-induced mortality in the C. finmarchicus egg population could be as high as
32.5 %, while the impact on the cod egg population was no more than 1.2%. Variability in cloud cover, water transparency (and the
variables that affect it), and vertical distribution and displacement of planktonic organisms within the mixed layer can all have a
greater effect on the flux of UV-B radiation to which they are exposed than will ozone layer depletion at these latitudes. Our observations
indicate that C, finmarchicus and cod eggs present in the first meter of the water column (likely only a small percentage of
the total egg populations) are susceptible to W radiation. However, although exposure to UV can negatively impact crustacean
zooplankton and ichthyoplankton populations, these direct effects are likely minimal within the context of all the other environmental
factors that produce the very high levels of mortality typically observed in their planktonic early life stages. The impact of
indnect effects-which may well be of much greater import-has yet to be evaluated
- …