64 research outputs found
Solving Wave Equations on Unstructured Geometries
Waves are all around us--be it in the form of sound, electromagnetic
radiation, water waves, or earthquakes. Their study is an important basic tool
across engineering and science disciplines. Every wave solver serving the
computational study of waves meets a trade-off of two figures of merit--its
computational speed and its accuracy. Discontinuous Galerkin (DG) methods fall
on the high-accuracy end of this spectrum. Fortuitously, their computational
structure is so ideally suited to GPUs that they also achieve very high
computational speeds. In other words, the use of DG methods on GPUs
significantly lowers the cost of obtaining accurate solutions. This article
aims to give the reader an easy on-ramp to the use of this technology, based on
a sample implementation which demonstrates a highly accurate, GPU-capable,
real-time visualizing finite element solver in about 1500 lines of code.Comment: GPU Computing Gems, edited by Wen-mei Hwu, Elsevier (2011), ISBN
9780123859631, Chapter 1
PyCOOL - a Cosmological Object-Oriented Lattice code written in Python
There are a number of different phenomena in the early universe that have to
be studied numerically with lattice simulations. This paper presents a graphics
processing unit (GPU) accelerated Python program called PyCOOL that solves the
evolution of scalar fields in a lattice with very precise symplectic
integrators. The program has been written with the intention to hit a sweet
spot of speed, accuracy and user friendliness. This has been achieved by using
the Python language with the PyCUDA interface to make a program that is easy to
adapt to different scalar field models. In this paper we derive the symplectic
dynamics that govern the evolution of the system and then present the
implementation of the program in Python and PyCUDA. The functionality of the
program is tested in a chaotic inflation preheating model, a single field
oscillon case and in a supersymmetric curvaton model which leads to Q-ball
production. We have also compared the performance of a consumer graphics card
to a professional Tesla compute card in these simulations. We find that the
program is not only accurate but also very fast. To further increase the
usefulness of the program we have equipped it with numerous post-processing
functions that provide useful information about the cosmological model. These
include various spectra and statistics of the fields. The program can be
additionally used to calculate the generated curvature perturbation. The
program is publicly available under GNU General Public License at
https://github.com/jtksai/PyCOOL . Some additional information can be found
from http://www.physics.utu.fi/tiedostot/theory/particlecosmology/pycool/ .Comment: 23 pages, 12 figures; some typos correcte
FP8 Formats for Deep Learning
FP8 is a natural progression for accelerating deep learning training
inference beyond the 16-bit formats common in modern processors. In this paper
we propose an 8-bit floating point (FP8) binary interchange format consisting
of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit
exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for
representatio of special values, E4M3's dynamic range is extended by not
representing infinities and having only one mantissa bit-pattern for NaNs. We
demonstrate the efficacy of the FP8 format on a variety of image and language
tasks, effectively matching the result quality achieved by 16-bit training
sessions. Our study covers the main modern neural network architectures - CNNs,
RNNs, and Transformer-based models, leaving all the hyperparameters unchanged
from the 16-bit baseline training sessions. Our training experiments include
large, up to 175B parameter, language models. We also examine FP8
post-training-quantization of language models trained using 16-bit formats that
resisted fixed point int8 quantization
Simulation of reaction-diffusion processes in three dimensions using CUDA
Numerical solution of reaction-diffusion equations in three dimensions is one
of the most challenging applied mathematical problems. Since these simulations
are very time consuming, any ideas and strategies aiming at the reduction of
CPU time are important topics of research. A general and robust idea is the
parallelization of source codes/programs. Recently, the technological
development of graphics hardware created a possibility to use desktop video
cards to solve numerically intensive problems. We present a powerful parallel
computing framework to solve reaction-diffusion equations numerically using the
Graphics Processing Units (GPUs) with CUDA. Four different reaction-diffusion
problems, (i) diffusion of chemically inert compound, (ii) Turing pattern
formation, (iii) phase separation in the wake of a moving diffusion front and
(iv) air pollution dispersion were solved, and additionally both the Shared
method and the Moving Tiles method were tested. Our results show that parallel
implementation achieves typical acceleration values in the order of 5-40 times
compared to CPU using a single-threaded implementation on a 2.8 GHz desktop
computer.Comment: 8 figures, 5 table
MLPerf Inference Benchmark
Machine-learning (ML) hardware and software system demand is burgeoning.
Driven by ML applications, the number of different ML inference systems has
exploded. Over 100 organizations are building ML inference chips, and the
systems that incorporate existing models span at least three orders of
magnitude in power consumption and five orders of magnitude in performance;
they range from embedded devices to data-center solutions. Fueling the hardware
are a dozen or more software frameworks and libraries. The myriad combinations
of ML hardware and ML software make assessing ML-system performance in an
architecture-neutral, representative, and reproducible manner challenging.
There is a clear need for industry-wide standard ML benchmarking and evaluation
criteria. MLPerf Inference answers that call. In this paper, we present our
benchmarking method for evaluating ML inference systems. Driven by more than 30
organizations as well as more than 200 ML engineers and practitioners, MLPerf
prescribes a set of rules and best practices to ensure comparability across
systems with wildly differing architectures. The first call for submissions
garnered more than 600 reproducible inference-performance measurements from 14
organizations, representing over 30 systems that showcase a wide range of
capabilities. The submissions attest to the benchmark's flexibility and
adaptability.Comment: ISCA 202
A Full-Depth Amalgamated Parallel 3D Geometric Multigrid Solver for GPU Clusters
Numerical computations of incompressible flow equations with pressure-based algorithms necessitate the solution of an elliptic Poisson equation, for which multigrid methods are known to be very efficient. In our previous work we presented a dual-level (MPI-CUDA) parallel implementation of the Navier-Stokes equations to simulate buoyancy-driven incompressible fluid flows on GPU clusters with simple iterative methods while focusing on the scalability of the overall solver. In the present study we describe the implementation and performance of a multigrid method to solve the pressure Poisson equation within our MPI-CUDA parallel incompressible flow solver. Various design decisions and algorithmic choices for multigrid methods are explored in light of NVIDIA’s recent Fermi architecture. We discuss how unique aspects of an MPI-CUDA implementation for GPU clusters is related to the software choices made to implement the multigrid method. We propose a new coarse grid solution method of embedded multigrid with amalgamation and show that the parallel implementation retains the numerical efficiency of the multigrid method. Performance measurements on the NCSA Lincoln and TACC Longhorn clusters are presented for up to 64 GPUs
Seismic Wave Propagation Simulations on Low-power and Performance-centric Manycores
International audienceThe large processing requirements of seismic wave propagation simulations make High Performance Computing (HPC) architectures a natural choice for their execution. However, to keep both the current pace of performance improvements and the power consumption under a strict power budget, HPC systems must be more energy e than ever. As a response to this need, energy-e and low-power processors began to make their way into the market. In this paper we employ a novel low-power processor, the MPPA-256 manycore, to perform seismic wave propagation simulations. It has 256 cores connected by a NoC, no cache-coherence and only a limited amount of on-chip memory. We describe how its particular architectural characteristics influenced our solution for an energy-e implementation. As a counterpoint to the low-power MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore. Although both processors share some architectural similarities, the challenges to implement an e seismic wave propagation kernel on these platforms are very di↵erent. In this work we compare the performance and energy e of our implementations for these processors to proven and optimized solutions for other hardware platforms such as general-purpose processors and a GPU. Our experimental results show that MPPA-256 has the best energy e consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of our solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs
- …