Search CORE

64 research outputs found

Solving Wave Equations on Unstructured Geometries

Author: Dubiner
Hesthaven
Klöckner
Klöckner
Koornwinder
Micikevicius
Mohammadian
Warburton
Warburton
Publication venue
Publication date: 01/01/2012
Field of study

Waves are all around us--be it in the form of sound, electromagnetic radiation, water waves, or earthquakes. Their study is an important basic tool across engineering and science disciplines. Every wave solver serving the computational study of waves meets a trade-off of two figures of merit--its computational speed and its accuracy. Discontinuous Galerkin (DG) methods fall on the high-accuracy end of this spectrum. Fortuitously, their computational structure is so ideally suited to GPUs that they also achieve very high computational speeds. In other words, the use of DG methods on GPUs significantly lowers the cost of obtaining accurate solutions. This article aims to give the reader an easy on-ramp to the use of this technology, based on a sample implementation which demonstrates a highly accurate, GPU-capable, real-time visualizing finite element solver in about 1500 lines of code.Comment: GPU Computing Gems, edited by Wen-mei Hwu, Elsevier (2011), ISBN 9780123859631, Chapter 1

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

PyCOOL - a Cosmological Object-Oriented Lattice code written in Python

Author: A. Chambers
A. Chambers
A. {Klöckner} .
A.V. Frolov
D. Groen
E. Gaburov
G. Khanna
G.N. Felder
H.-Y. Schive
J Sainio
K.-I. Ishikawa
M.A. Amin
N. Nakasato
NVIDIA
P. Micikevicius
R. Capuzzo-Dolcetta
R. Easther
R.J. Brunner
S. Banerjee
S. Ord .
S. von Hoerner
S. von Hoerner
S.K. Chung
T. Hiramatsu
T. {Szalay}
V. Anselmi
V. Demchik
Publication venue: 'IOP Publishing'
Publication date: 30/04/2012
Field of study

There are a number of different phenomena in the early universe that have to be studied numerically with lattice simulations. This paper presents a graphics processing unit (GPU) accelerated Python program called PyCOOL that solves the evolution of scalar fields in a lattice with very precise symplectic integrators. The program has been written with the intention to hit a sweet spot of speed, accuracy and user friendliness. This has been achieved by using the Python language with the PyCUDA interface to make a program that is easy to adapt to different scalar field models. In this paper we derive the symplectic dynamics that govern the evolution of the system and then present the implementation of the program in Python and PyCUDA. The functionality of the program is tested in a chaotic inflation preheating model, a single field oscillon case and in a supersymmetric curvaton model which leads to Q-ball production. We have also compared the performance of a consumer graphics card to a professional Tesla compute card in these simulations. We find that the program is not only accurate but also very fast. To further increase the usefulness of the program we have equipped it with numerous post-processing functions that provide useful information about the cosmological model. These include various spectra and statistics of the fields. The program can be additionally used to calculate the generated curvature perturbation. The program is publicly available under GNU General Public License at https://github.com/jtksai/PyCOOL . Some additional information can be found from http://www.physics.utu.fi/tiedostot/theory/particlecosmology/pycool/ .Comment: 23 pages, 12 figures; some typos correcte

arXiv.org e-Print Archive

Crossref

FP8 Formats for Deep Learning

Author: Burgess Neil
Cornea Marius
Dubey Pradeep
Grisenthwaite Richard
Ha Sangwon
Heinecke Alexander
Judd Patrick
Kamalu John
Mellempudi Naveen
Micikevicius Paulius
Oberman Stuart
Shoeybi Mohammad
Siu Michael
Stosic Dusan
Wu Hao
Publication venue
Publication date: 29/09/2022
Field of study

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization

arXiv.org e-Print Archive

Simulation of reaction-diffusion processes in three dimensions using CUDA

Author: Alexandrov
Anderson
Belleman
Block
Buluc
Castano-Diez
Castets
Che
Costello
Cross
Dabdub
Epstein
Ferenc Izsák
Ferenc Molnár
Ford
Fowler
Garland
Gutiérrez
Horváth
Horváth
Horváth
Huang
István Lagzi
Januszewski
Komatitsch
Komatitsch
Lagzi
Lagzi
Lagzi
Lagzi
Lengyel
Li
Liu
Liu
Lovas
Martin
Melchionna
Micikevicius
Molnár
Nakamasu
NVIDIA Corporation
Owens
Preis
Pápai
Rácz
Róbert Mészáros
Sainio
Sanderson
Sanna
Sanna
Schmidt
Senocak
Shoji
Shoji
Simek
Stone
Stone
Sultan
Volford
Volford
Walsh
Publication venue: 'Elsevier BV'
Publication date: 03/04/2010
Field of study

Numerical solution of reaction-diffusion equations in three dimensions is one of the most challenging applied mathematical problems. Since these simulations are very time consuming, any ideas and strategies aiming at the reduction of CPU time are important topics of research. A general and robust idea is the parallelization of source codes/programs. Recently, the technological development of graphics hardware created a possibility to use desktop video cards to solve numerically intensive problems. We present a powerful parallel computing framework to solve reaction-diffusion equations numerically using the Graphics Processing Units (GPUs) with CUDA. Four different reaction-diffusion problems, (i) diffusion of chemically inert compound, (ii) Turing pattern formation, (iii) phase separation in the wake of a moving diffusion front and (iv) air pollution dispersion were solved, and additionally both the Shared method and the Moving Tiles method were tested. Our results show that parallel implementation achieves typical acceleration values in the order of 5-40 times compared to CPU using a single-threaded implementation on a 2.8 GHz desktop computer.Comment: 8 figures, 5 table

arXiv.org e-Print Archive

Crossref

University of Twente Research Information

MLPerf Inference Benchmark

Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.Comment: ISCA 202

arXiv.org e-Print Archive

Crossref

A Full-Depth Amalgamated Parallel 3D Geometric Multigrid Solver for GPU Clusters

Author: Brandt A.
Brandvik T.
Corrigan A.
Cwire
Cwire
Elsen E.
Fan Z.
Goodnight N.
Griebel M.
Gropp W. D.
Göddeke D.
Hempel R.
Kindratenko V.
Matsuoka S.
McBryan O. A.
Micikevicius P.
Owens J.D.
Press W. H.
Schive H.
Showerman M.
Thibault J. C.
Tokyo Institute
Wan D.C.
Publication venue: 'IUScholarWorks'
Publication date: 04/01/2011
Field of study

Numerical computations of incompressible flow equations with pressure-based algorithms necessitate the solution of an elliptic Poisson equation, for which multigrid methods are known to be very efficient. In our previous work we presented a dual-level (MPI-CUDA) parallel implementation of the Navier-Stokes equations to simulate buoyancy-driven incompressible fluid flows on GPU clusters with simple iterative methods while focusing on the scalability of the overall solver. In the present study we describe the implementation and performance of a multigrid method to solve the pressure Poisson equation within our MPI-CUDA parallel incompressible flow solver. Various design decisions and algorithmic choices for multigrid methods are explored in light of NVIDIA’s recent Fermi architecture. We discuss how unique aspects of an MPI-CUDA implementation for GPU clusters is related to the software choices made to implement the multigrid method. We propose a new coarse grid solution method of embedded multigrid with amalgamation and show that the parallel implementation retains the numerical efficiency of the multigrid method. Performance measurements on the NCSA Lincoln and TACC Longhorn clusters are presented for up to 64 GPUs

Crossref

Boise State University - ScholarWorks

Seismic Wave Propagation Simulations on Low-power and Performance-centric Manycores

Author: Abdelkhalek
Aochi
Aubry
Bianco
Castro
Christen
Datta
de Dinechin
Dumbser
Dupros
Dupros
Dursun
Emilio Francesquini
Fabrice Dupros
Francesquini
Francesquini
Göddeke
Hideo Aochi
Horowitz
Hähnel
Jean-François Méhaut
Komatitsch
Krueger
Lawson
Lysmer
Martin
Mercier
Michéa
Micikevicius
Morari
Márcio Castro
Pereira
Philippe O.A. Navaux
Pilla
Rajovic
Rashti
Reinders
Rivera
Saenger
Tang
Totoni
Varghese
Virieux
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

International audienceThe large processing requirements of seismic wave propagation simulations make High Performance Computing (HPC) architectures a natural choice for their execution. However, to keep both the current pace of performance improvements and the power consumption under a strict power budget, HPC systems must be more energy e than ever. As a response to this need, energy-e and low-power processors began to make their way into the market. In this paper we employ a novel low-power processor, the MPPA-256 manycore, to perform seismic wave propagation simulations. It has 256 cores connected by a NoC, no cache-coherence and only a limited amount of on-chip memory. We describe how its particular architectural characteristics influenced our solution for an energy-e implementation. As a counterpoint to the low-power MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore. Although both processors share some architectural similarities, the challenges to implement an e seismic wave propagation kernel on these platforms are very di↵erent. In this work we compare the performance and energy e of our implementations for these processors to proven and optimized solutions for other hardware platforms such as general-purpose processors and a GPU. Our experimental results show that MPPA-256 has the best energy e consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of our solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server