A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics by Dorelli, John C. & Bard, Christopher
A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal
magnetohydrodynamics
Christopher M. Barda,∗, John C. Dorellib,∗∗
aUniversity of Wisconsin, Madison
bNASA-GSFC
Abstract
We describe our experience using NVIDIA’s CUDA (Compute Uniﬁed Device Architecture) C programming environment to im-
plement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics
Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the
GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in
the GPU’s faster shared memory), we achieved a maximum speed-up of ≈ 126 for a 10242 grid relative to the sequential C code
running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known ﬂoating
point performance, memory throughput and parallel processing capacity of the GTX 480.
Keywords: GPU, Magnetohydrodynamics
1. Introduction
The last several years has witnessed a dramatic increase in
the use of Graphics Processing Units (GPUs) to accelerate sci-
entiﬁc computing applications. The primary catalyst for this
surge in GPU computing was NVIDIA’s public release of the
CUDA (Compute Uniﬁed Device Architecture) programming
environment [1] in 2007. The introduction of CUDA provided
the scientiﬁc programmer with easy access to the SIMT (Single
Instruction Multiple Thread) architecture underlying modern
NVIDIA GPUs (previously accessible only through cumber-
some graphics APIs like OpenGL) via an intuitive extension of
ANSI C. The result has been a proliferation of GPU-accelerated
applications in such diverse areas as N-body simulation [2, 3],
signal processing [4], molecular dynamics simulation [5, 6] and
computational ﬂuid dynamics (CFD) [7, 8]. Speedups reported
in the literature depend on the application, but one can expect,
on average, an order of magnitude performance increase ”out
of the box” using CUDA.
Explicit ﬁnite volume schemes are particularly amenable to
GPU acceleration due to the natural manner in which the com-
putational mesh maps to the SIMT architecture. In a typical im-
plementation, chunks of the mesh are handed out by the CPU
to groups of threads executing on the GPU, with each thread
responsible for updating a single computational cell using data
in nearby memory locations. Each cell update involves many
instructions (some combination of arithmetic operations and
memory reads/writes), and the GPU achieves its performance
∗Corresponding author. Tel: +1 301 286 9753.
∗∗Corresponding author. Tel: +1 301 286 9753.
Email addresses: bard@astro.wisc.edu (Christopher M. Bard ),
john.dorelli@nasa.gov (John C. Dorelli )
advantage over the CPU through a combination of parallel exe-
cution (hundreds of threads may execute instructions simultane-
ously on multiple GPU ”cores”) and latency hiding (if a thread
stalls while executing a slow instruction, such as accessing oﬀ-
chip memory, execution may be switched to one of thousands
other ”active threads” ready for execution).
While the speedups of GPU-accelerated ﬁnite volume ﬂuid
codes reported in the literature are impressive (e.g., Schive et al.
[8] report a speedup of ∼100x on a uniform mesh with several
million cells), GPUs still have some signiﬁcant limitations. The
primary appeal of GPU computing is the promise of achieving
the computational equivalent of 10-100 latest-generation CPU
cores with a single graphics card for a fraction of the cost. Un-
fortunately, limited oﬀ-chip memory, absence of cache memory
and poor double precision performance (or no double precision
capability at all) on inexpensive consumer-grade graphics cards
present barriers for the computational scientist interested in ac-
celerating even a moderately sized (say, several million com-
putational cells) CFD problem. While NVIDIA’s Tesla cards
– designed with high performance computing applications in
mind – remove some of these constraints, they are signiﬁcantly
more expensive than the consumer-grade products (e.g., com-
pare the ∼$2500 price tag of the Tesla M2090 to the several
hundred dollar cost of the GTX 480).
GPU programming eﬀort presents another obstacle for the
average computational scientist. While CUDA C has made
programming graphics cards much less esoteric, taking full ad-
vantage of optimization opportunities is not a trivial exercise
and requires more than a cursory knowledge of the underlying
hardware. One must consider whether the investment of sev-
eral programmer-months (or more) is worth the expected per-
formance gain if one already has a code that scales well up
to several thousand CPU cores and/or takes advantage of ex-
Preprint submitted to Journal of Computational Physics November 19, 2013
https://ntrs.nasa.gov/search.jsp?R=20140017815 2019-08-31T15:22:16+00:00Z
isting APIs such as OpenMP and OpenACC to gain moderate
speedups with relatively small amounts of programming eﬀort.
For example, GPU-accelerated ﬁnite volume solvers that use
MPI to exchange ghost cell data residing on separate compute
nodes may not scale gracefully up to hundreds of GPUs, since
the slow rate of data transfer from GPU to CPU across the PCI
bus essentially eliminates the beneﬁt of a fast interconnect like
InﬁniBand. This drawback may be largely overcome, however,
by NVIDIA’s GPUDirect technology, which will support Direct
Memory Access (DMA) between GPUs in separate compute
nodes.
In this article, we describe our eﬀort to port a MUSCL-
Hancock ideal magnetohydrodynamics (MHD) solver to a sin-
gle NVIDIA GTX 480 graphics card. We implement a rela-
tively simple method for GPU speedup that does not require
signiﬁcant knowledge of the underlying GPU hardware. Our
goal is to demonstrate that despite the diﬃculties inherent in
GPU programming, one can achieve, with minimal program-
ming eﬀort, a signiﬁcant performance gain (more than two or-
ders of magnitude compared to the sequential algorithm run-
ning on a single Nehalem core) for a problem of moderate size
(more than a million computational cells) on a consumer-grade
graphics card. We hope that our article will serve as a roadmap
for those seeking to port a standard ﬁnite volume MHD code to
a GPU (or GPU cluster).
2. GTX 480 Architecture and the CUDA Programming
Model
We begin with a short description of NVIDIA’s Fermi ar-
chitecture and the CUDA programming model (see NVIDIA
White Paper [9]) for further details). The GTX 480 was the
natural choice for our experiment since it was the ﬁrst widely
available (for several hundred USD) consumer graphics card to
implement NVIDIA’s Fermi architecture, which is the same ar-
chitecture implemented in the much more expensive high end
Tesla series marketed to the high performance computing com-
munity. The larger number of streaming processors (SP) and
active threads per SP on Fermi GPUs allows for greater par-
allelism and latency hiding. Additionally, Fermi GPUs have
improved memory performance, including L1 and L2 caches,
reducing the need to explicitly stage data in the faster shared
memory.
The GTX 480 consists of 15 Streaming Multiprocessors, 6
GB of GDDR5 RAM (memory bandwidth of 177.4 GB/s), a
768 KB L2 cache, a host interface (through which the GPU
communicates with the host CPU via PCI-express), and a
scheduler that distributes blocks of threads to the SMs. Fig-
ure 1 shows a schematic of a single Streaming Multiprocessor
(SM) on the GTX 480. Each SM consists of 32 Stream Pro-
cessors (SP) (what NVIDIA refers to as ”CUDA cores”), 16
load/store units and 4 special function units. The SM sched-
ules threads for execution in groups of 32 called ”warps”, with
each thread in a warp executing the same instruction. In the
Fermi architecture, there are two such ”warp schedulers” per
SM, each capable of issuing a single-precision instruction (e.g.,
a multiply or add) to 16 of the 32 cores in one shader clock
                  16
Stream Processors (SP)
          (CUDA cores)
                  16
Stream Processors (SP)
          (CUDA cores)
        16
Load/Store
    Units
         4
  Special   
 Function






Warp Scheduler Warp Scheduler
Dispatch Unit Dispatch Unit
Figure 1: A GTX 480 Streaming Multiprocessor consists of
two warp schedulers, each of which can issue a single-precision
instruction to 16 cores every shader clock cycle; thus, the
single-precision instruction throughput is one warp (32 threads)
instruction every shader cycle. The double precision instruction
throughput is a factor of eight slower: 1 warp instruction every
8 shader clock cycles.
2
cycle, thus providing a single-precision throughput of 1 warp
instruction per cycle. For double-precision instructions, a sin-
gle warp scheduler issues the instruction to 16 of the 32 cores
in one shader clock cycle. While the hardware is capable of
providing a double-precision throughput of 1 warp instruction
every two cycles, NVIDIA has intentionally reduced the double
precision throughput of its consumer grade cards (such as the
GTX480) by a further factor of 4. With 15 SMs and a shader
clock speed of 1401 MHz, the overall ﬂoating point speed of
the GTX 480 is 672 GFLOPS for single-precision multiplies or
adds and 84 GFLOPS for double precision multiplies or adds
(double these throughputs for fused multiply-adds).
We turn now to a summary of the CUDA programming
model, illustrated schematically in Figure 2 (see the NVIDIA
CUDA C Programming Guide [10] for further detail). The
main serial code, running on the CPU (host), allocates GPU
memory and launches a kernel that executes on the GPU (de-
vice). The programmer speciﬁes, using a simple syntactical ex-
tension of the familiar C function call, the number of threads
to execute on the GPU by specifying an MxN grid of thread
blocks (yellow squares in Figure 2). Note that thread blocks are
a programmer-deﬁned construct to organize the calculation on
the GPU. When a SM recieves a thread block, it partitions the
block into warps which are then given to the scheduler for exe-
cution. Threads within a block can access a small pool (conﬁg-
urable to either 16 KB or 48 KB per block in Fermi) of shared
memory and can be barrier-synchronized. Thread blocks are
executed asynchronously on the SMs; that is, the programmer
does not have the ability to synchronize the execution of the
threads in diﬀerent blocks (for example, by calling a ”barrier”
function). This asynchronous block execution allows the same
CUDA C source code to work on later-generation GPUs that
can execute more blocks concurrently. However, the ability to
synchronize threads within a block resulted in the design con-
straint (to avoid long wait times) that all threads in a block are
assigned to the same SM.
NVIDIA categorizes its GPUs according to ”compute capa-
bility.” The GTX 480 has a compute capability of 2.0, which
implies the following resource constraints:
1. The maximum number of threads per block is 1024.
2. The maximum number of blocks that can be assigned to a
single SM is 8.
3. The maximum number of warps that can be managed on
a single SM is 48 (implying that the maximum number of
threads on an SM is 1536).
When calling a kernel function, the programmer must decide
carefully how many threads to assign to a block. A general
rule of thumb is that if one’s application is memory bandwidth-
bound, then one should strive for maximum thread occupancy
(i.e., maximize the number of active warps on an SM) to take
advantage of memory latency hiding. One should also set the
number of threads per block to be an integer number of warps
to avoid underpopulated warps (which result in idle execution
resources). While NVIDIA provides an Occupancy Calculator
spreadsheet to facilitate experimentation with diﬀerent thread
block conﬁgurations, using the Occupancy Calculator does not,
as we will see below, guarantee optimal speedup.
3. Implementation of the MUSCL-Hancock scheme for
ideal MHD
Implementing a ﬁnite volume ﬂuid solver in CUDA is
straightforward in principle, since the basic components of a
standard Godunov scheme (reconstruction of edge states, evalu-
ation of edge numerical ﬂuxes using a Riemann solver, and con-
servative evolution of cell-centered quantities) map naturally to
the CUDA SIMT architecture. That is, the components of the
algorithm can be implemented as CUDA kernels in which each
execution thread performs the same sequence of operations on a
single computational cell. In the case of the MUSCL-Hancock
scheme [11], we deﬁne a kernel for each of the following steps:
1) slopes: compute limited slopes at each computational cell, 2)
reconstruction: reconstruct the conserved variables at the cell
edges, 3) evolve1: advance reconstructed edge variables half a
time step using the physical ﬂuxes at the edges, 4) riemann:
compute the numerical ﬂuxes at the cell edges using a Riemann
solver, and 5) evolve2: advance the conserved variables a full
time step using the numerical ﬂuxes computed in the previous
step.
Following Dedner et al. [12], we solve the ”Generalized La-
grangianMultiplier” (GLM) formulation of theMHD equations
to preserve the solenoidal constraint on the magnetic ﬁeld:
∂ρ
∂t




















γ − 1 p + B





+ ∇ · (vB − Bv) + ∇ψ = 0, (4)
∂ψ
∂t




where ρ is the mass density, v is the bulk velocity, p is the
plasma pressure, B is the magnetic ﬁeld, E = ρv2/2 + p/(γ −
1)+B2/2 is the total energy density, and γ is the ratio of speciﬁc
heats (taken to be 5/3 in all of our simulations).
In equations (4) and (5), ψ is a scalar function whose evo-
lution, is, by construction, equivalent to that of ∇ · B; thus,
the parameters ch and cp represent the propagation and dis-
sipation speeds of local magnetic ﬁeld divergence errors. ch
is chosen to be the global maximum (over the grid cells) of
max(|c f+vx|, |c f+vy|, |c f+vz|), where c f is the fast magnetosonic
wave speed. Note that in the 1D case (where x is the dependent
variable) the equations for Bx and ψ are decoupled from the
remaining MHD equations; thus, in the two-dimensional case,
the globally large ch applies only to those x numerical ﬂuxes
associated with Bx and ψ, both of which will typically be very
small. Similarly, only the By and ψ numerical ﬂuxes use the
3
CPU (Host) GPU (Device) CPU (Host) GPU (Device)
Kernel Kernel
Figure 2: In the CUDA programming model, the CPU (”Host”) launches a sequence of ”kernels” on the GPU (”Device”). Each
kernel invocation launches a user-speciﬁed number of thread blocks (yellow boxes), each block containing the same number of
threads. The number of thread blocks and the number of threads/block varies from kernel to kernel, but one should in general strive
to maximize the number of total active threads per SM.
globally large ch. The remaining numerical ﬂuxes are computed
using local wave speeds. Dedner et al. [12] found that setting
c2p/ch = 0.18 produced optimal results regardless of the grid
resolution, and we use the same value in all of our simulations.
We discretize the system (1)-(5) using a second-order
MUSCL-Hancock scheme [11] with a minmod slope limiter to
suppress oscillations near discontinuities, Strang splitting [13]
to maintain second-order temporal accuracy and an HLL ap-
proximate Riemann solver [14] to compute numerical ﬂuxes.
We emphasize that our use of time-splitting was not moti-
vated by any GPU hardware or software constraints; one could
easily implement an unsplit time stepping scheme using the
same basic approach (indeed, we have developed an unsplit
version of this MHD code and intial tests show similar per-
formance gains). Similarly, our choice of divergence clean-
ing is not driven by GPU hardware or software constraints.
Other MHD algorithms which use staggered-mesh formula-
tions to enforce the divergence constraint to machine precision
[15, 16, 17, 18, 19, 20, 21] should also be easily ported to
GPUs.
CUDA C source code for the main driver program
(main.cu) and the subroutine that handles a partial time step
in the X dimension (evolvex) are given in the Appendix. The
rest of the source code is available online with the supplemen-
tary materials.
Appendix B shows the implementation of the CUDA ker-
nel evolvex (the z operator is implemented in a similar man-
ner). evolvex takes as its ﬁrst argument the pointer p cons
– allocated in main.cu using cudaMalloc (see Appendix
A) – to the GPU global memory where the solution vector is
stored. The thread block conﬁguration is speciﬁed by declar-
ing two variables of type dim3, dimGrid and dimBlock,
that specify the dimensions of our two-dimensional block of
threads. (Note that the absence of a third argument in these
declarations deﬁnes the third dimension to be 1.) Recall-
ing that the number of threads in a block should be an inte-
ger multiple of 32 (to avoid underpopulated warps), we spec-
ify a TILE WIDTH × TILE WIDTH array of threads, where
TILE WIDTH is typically set to either 8 or 16 (giving, respec-
tively, 64 and 256 threads per block; see Section 5 for a com-
parison of diﬀerent values of TILE WIDTH). The declaration
of dimGrid speciﬁes the dimensions of our two-dimensional
grid of thread blocks. Here, we set the number of blocks in
each dimension so that the total number of threads is equal to
the number of computational cells (NX× NZ); thus, each thread
is responsible for updating a single computational cell at each
time step. We deﬁne dimGrid and dimBlock the same way
throughout the code (with the exception of the boundary condi-
tion kernels, which only involve those cells at the edges). The
function calls in evolvex map straightforwardly to the steps
of the MUSCL-Hancock algorithm:
1. slopes: compute xslopes
2. reconstruction: d from slope
3. evolve1: d from fg ker
4. riemann: wave speeds ker, physical flux f ker,
hll flux f
5. evolve2: cons from f ker, damp psic ker
Given a cell-averaged solution vector Uni , where i is the cell
index and n is the time index, the ﬁrst step in the MUSCL-
Hancock scheme is to linearly extrapolate Uni to the left/right









where ΔUni is the slope calculated by the minmod slope limiter,
ΔUni = minmod[(U
n
i+1 − Uni ), (Uni − Uni−1)] . (7)
We implement slopes step (7) in the ker-
nel compute xslope ker (the kernel
compute zslope ker extrapolates to the bottom/top
cell interfaces). compute xslope takes as input the GPU
pointer to the solution vector (the vector of conserved cell
4
averages, p cons) and the GPU pointer to the slopes array
(p xslope). Within compute xslope ker, each thread
deﬁnes its location in the grid (stored in the variables a and b)
using both its location within the thread block and the block’s
location within the grid. Each thread calculates the slope at
its cell by accessing the global memory pointed to by p cons
(where index3 is a macro deﬁned in macros.h). Slopes are
then stored in the global memory pointed to by p xslope.
The remaining steps in the MUSCL-Hancock algorithm are
implemented in a similar manner, with each thread responsible
for its own computational cell and all executing threads
performing the same sequence of operations at each time step.
We note here that our implementation of the minmod limiter
involves a number of conditional statements (to check, for ex-
ample, whether the left and right slopes diﬀer in sign) which,
while not optimal in CUDA’s SIMT architecture, is nonetheless
easier to code. Speciﬁcally, if some threads in a warp satisfy the
if block and the rest satisfy the else block, then all threads in
the warp must execute the if and else blocks in succession.
We point out that other Riemann solvers, e.g. Roe-type [22],
HLLC [23, 24] and HLLD [25], incorporate several “if-then”
conditions which will result in a performance hit. However,
we do not anticipate this to be a large eﬀect since most neigh-
boring cells away from a discontinuity will have similar states.
This means that each thread in a warp will more often than not
fulﬁll the same if-then condition within the Riemann solver, re-
ducing the number of instructions each warp has to execute. In
general, though, one should strive to minimize the use of con-
ditionals with divergent threads.
The reconstruction step (eq. 6) is implemented in the ker-
nel d from slope ker, which is invoked in both evolvex
and evolvez. d from slope ker makes use of registers
to limit the number of global memory accesses; the variables
cons value and slope value store the value found in
global memory, which is then used in the subsequent calcula-
tion. While reading this data from global memory and storing it
in registers doubles the number of steps in the kernel, replacing
the global memory reads with much faster register reads results
in a speed-up of ≈ 15%.
After reconstruction, evolve1 advances the extrapolated












[F(URi ) − F(ULi )], (8)
where F(URi ) and F(U
L
i ) are the ideal MHD physi-
cal ﬂuxes evaluated at the left and right interfaces,
respectively. These physical ﬂuxes are computed
in the kernel physical flux f ker (likewise
physical flux g ker) which, like d from slope,
makes use of thread registers to limit global memory accesses.
evolve1 (eq 8) is performed in d from fg ker, which is very
similar in implementation to d from slope ker.
Numerical ﬂuxes at the cell interfaces are computed using
the HLL Riemann solver:
FUni+1/2 =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
F(URi , ) if 0 ≤ cLi .
cRi+1F(U
R
i )−cLi F(ULi+1)+cLi cRi+1(ULi+1−URi )
cRi+1−cLi
, if cLi ≤ 0 ≤ cRi+1.
F(ULi+1), if 0 ≥ cRi+1.
(9)
where i + 1/2 denotes the interface between cells i and i + 1, cLi
is the minimum wave speed in cell i, and cRi+1 is the maximum
wave speed in cell i + 1. The wave speeds are computed in
kernels wave speeds ker and hll flux f ker.
Finally, evolve2, implemented in kernel
cons from f ker, advances the cell-averaged solution






(Fni+1/2 − Fni−1/2). (10)
As a general rule, one should strive to minimize data transfers
from CPU to GPU to avoid a signiﬁcant PCI bandwidth bottle-
neck. For example, our CUDA implementation of MUSCL-
Hancock involves deﬁning a separate kernel to implement each
basic component of the algorithm (slopes, reconstruction,
evolve1, riemann, and evolve2). Thus, each kernel must re-
ceive input data from the previous kernel and send output data
to the subsequent kernel. Rather than copy this data back and
forth from CPU to GPU (using, for example, cudaMemcpy), it
is much more eﬃcient to allocate and deallocate GPU memory
as needed (using cudaMalloc) and pass the associated point-
ers to the kernels. This approach requires that the entire prob-
lem live on the GPU for the duration of the calculation, thus
limiting the problem size. Increasing the problem size would
require running on multiple GPUs (for example, by combin-
ing CUDA and MPI), taking care to minimize communication
between GPUs on diﬀerent compute nodes. The availablity of
large GPU clusters (e.g. Keeneland and Blue Waters) thus holds
the promise of similar eﬃciency gains with much larger prob-
lem sizes in the weak scaling limit.
A second generic optimization strategy is to favor register
and shared memory over global memory to the extent possi-
ble, taking advantage of the much lower register and shared
memory latency. One must, however, be aware of an important
tradeoﬀ between register and shared memory use and thread
parallelism. Register and shared memory on a particular SM is
divided up among all of the thread blocks on the SM. Exceed-
ing the register or shared memory available on the SM may
greatly reduce the number of active threads since threads can
only be assigned to the SM at the block granularity. Experi-
mentation (on an application by application basis) is required to
determine the proper balance between register and shared mem-
ory use, on the one hand, and thread parallelism on the other.
For example, although the GTX 480 can accommodate 1024
threads per block, we found that the register footprint for our
MUSCL-Hancock implementation limited us to a maximum of
256 threads per block. Interestingly, for problem sizes rang-
ing from 642 to 10242 computational cells, 64 threads per block
seemed to be optimal (for a ﬁxed register footprint). Experi-
mentation will be required for diﬀerent architectures; the best
5
combination of parameters for one particular type of GPU may
not be ideal for other GPUs.
4. Results
On the CPU, the total time, TCPU , required to apply a sin-
gle instruction over the entire computational mesh is simply
the execution latency of the instruction, LE,CPU , multiplied
by the number of mesh cells, Ncells: TCPU = NcellsLE,CPU .
GPUs achieve their performance gains by executing thousands
of threads in parallel, thereby hiding execution latency (which
in memory bound applications is dominated by global memory
latency). That is, a warp scheduler on an SM issues an instruc-
tion to an active warp (i.e., a warp that is ready to receive the
instruction), which then proceeds to make use of whatever SM
resources (arithmetic logic units, load/store units, special func-
tion units, etc.) it needs to execute the instruction. As the warp
executes its instruction, the warp scheduler attempts to issue an-
other instruction to another active warp; latency is completely
hidden when there is always an active warp available to carry
out the next instruction. Note, however, that it takes a ﬁnite
time – the issue latency, LI,GPU – for the scheduler to issue one
instruction to a warp. Here, LI,GPU is deﬁned as the inverse of
the instruction throughput (the number of instructions the SM
can issue per clock cycle). For example, as we reviewed in sec-
tion 2, the single-precision instruction throughput for the GTX
480 is 1 instruction per shader clock cycle so that LI,GPU = 1
cycle (1/1401 μsec for the GTX 480). If LE,GPU is the execu-
tion latency of the instruction (i.e., the time it takes a warp to
complete the instruction), then, assuming that there is always
an active warp ready to receive the next instruction (memory
latency is completely hidden), the total time, TW , for the NW
active warps on an SM to complete the instruction is simply
TW = (NW − 1)LI,GPU + LE,GPU . Since each active thread is
responsible for a single computational cell, the NS M SMs in
our GPU can process 32NWNS M mesh cells in time TW ; thus,
it takes the GPU a time TGPU = TW [Ncells/(32NWNS M)] to pro-
cess all Ncells cells in our computational mesh with the same
instruction. The expected speedup, S , for a sequence of add in-
structions (assuming the maximal number of active threads on






LE,GPU + (NW − 1)LI,GPU (11)
As an example, let us estimate the expected speedup for a
sequence of double precision adds (see the source code in Ap-
pendix E and Appendix F). Timing of the CPU code run-
ning on a single core (using gcc with the ”-O3” option) resulted
in an average execution latency, LE,CPU , of approximately 10
CPU cycles for our 2.8 GHz Intel Nehalem (with array sizes of
1024x1024). Assuming NW = 48 (full occupancy), LI,GPU = 8
shader cycles (recall that NVIDIA reduced the double preci-
sion throughput of the GTX 480 by a factor of 4 below the
1/2 double precision instruction per shader cycle of the HPC-
grade Fermi cards), and LE,GPU ≈ 600 shader clock cycles (the
GTX 480 global memory latency) in (11), we obtain an overall
speedup S ≈ 118 compared to the sequential algorithm running
on a single CPU core. This is close to our maximum speedup
of 126 reported in Table 2. Timing of the corresponding CUDA
kernel in Appendix F gives a speedup S ≈ 112.
We emphasize that our simple estimate does not take into ac-
count other factors that inﬂuence GPU and CPU performance.
Speciﬁcally, (11) is independent of problem size; nevertheless,
we expect the CPU memory latency to increase substantially
for larger problems due to cache misses. Since memory latency
is eﬀectively hidden by the GPU, we expect more impressive
GPU speedups for larger problem sizes. Additionally, we do
not take into account other issues such as the relatively slow
transfer of data between the CPU and GPU or the granularity of
thread block sizes. Our results show that the speedups do de-
pend on the number of threads per block, which indicates that
the block size aﬀects the number of active warps per SM. This
would change the value of NW in the above speedup equation.
These are likely the biggest factors accounting for the diﬀerence
between the estimated speedup (using (11)) and our measured
speedups (Table 2). Finally, to be fair, we should compare our
GPU performance on a parallel implementation of the CPU ap-
plication that makes use of all four Nehalem cores; in this case,
we expect an overall speedup – relative to a parallelized ver-
sion of our MUSCL-Hancock code running on all four CPU
cores – of roughly 118/4 ≈ 30. Additionally, many production
machines have CPUs with 8 to 16 cores; this would result in a
speedup closer to 10.
Ideally, one would like to achieve speedups consistent with
those predicted by equation (11) for a ”real world” applica-
tion (e.g., our MUSCL-Hancock algorithm) involving multiple
kernels executing sequences of instructions consisting of many
operations (not limited to adds and multiplies). To measure
the performance of our GPU-accelerated MUSCL-Hancock
solver, we simulated the Orszag-Tang vortex with the follow-
ing initial conditions (see, for example, Dai and Woodward
[17]): ρ = 25/(36π), p = 5/(12π), vx = − sin(2πz), vy =
0.0, vz = sin(2πx), bx = sin(2πz)/(4π), by = 0.0, and bz =
sin(2πx)/(4π). The simulation was run for 300 time steps in all
test cases (Figure 3 shows the simulation results for a longer
simulation that ran for approximately two Alfve´n times). The
CPU C source used in the test was compiled using gcc-4.4.3 for
two cases: 1) with ”-O3” option enabled (hereafter referred to
as the ”Optimized CPU”, and 2) without the ”-O3” option en-
abled (hereafter referred to as ”Unoptimized CPU”). The sim-
ulation was run on a single Intel Core i7 930 (2.8 GHz) core.
The CPU wall clock time was compared to the wall clock time
of several versions of the CUDAGPU code running on the GTX
480 (see Tables 2 and 3): 1) ”Register...” refers to versions with
TILE WIDTH = {2, 4, 8, 16} that make use of register mem-
ory wherever possible to store intermediate values (the solution
vector, however, remains in GPU global memory for the en-
tire calculation); 2) ”Global...” refers to the CUDA version that
makes extensive use of GPU global memory with minimal us-
age of registers and TILE WIDTH = {2, 4, 8, 16}. The timing
results included disk I/O time; that is, we started our clock at
the beginning of the main program and ended the clock after
the ﬁnal output had been written to disk.
6
Problem Size Unoptimized C Optimized C CUDA
642 13.37 s 6.45 s 0.57 s
1282 73.39 41.80 1.81
2562 484.33 277.73 5.24
5122 2366.45 1476.98 18.27
10242 11488.6 8029.35 63.84
Table 1: Average run times for various problem sizes for an unoptimized ideal MHD C code, an optimized ideal MHD C code, and
a CUDA ideal MHD code. The CUDA code used here is identical to the TILE WIDTH 8 code in Table 2.
Problem Size Register 16 Register 8 Register 4 Register 2
642 0.8 s (8.1) 0.57 s (11) 0.67 s (9.7) 1.2 s (5.4)
1282 2.25 (19) 1.81 (23) 2.04 (21) 4.34 (9.6)
2562 6.52 (43) 5.24 (53) 6.71 (41) 16.13 (17)
5122 22.58 (65) 18.72 (81) 25.19 (59) 68.12 (22)
10242 84.62 (95) 63.84 (126) 90.61 (89) 253.88 (32)
Table 2: Average run times for permutations of our single GPU ideal MHD code. Register number refers to the codes using a
register-heavy approach with the TILE WIDTH set to number. Numbers in parentheses are the approximate speedups relative to
the Optimized C timings in Table 1.
On average, the CUDA code is roughly 59 times faster than
Optimized C; however, it is clear from Table 1 that the speedup
increases with problem size. For example, for a 642 double pre-
cision problem, the CUDA code executes only 11 times faster
than Optimized C, while the 10242 double precision CUDA
code is 126 times faster than Optimized C. The increase in
speedup with problem size likely reﬂects the fact that the CPU
code is memory bound for large problem sizes, so that memory
bandwidth and latency determine the CPU execution latency,
LE,CPU in equation (11); in contrast, the global memory latency
is eﬀectively hidden for large problems running on the GPU.
Note that our CPU code does not implement any special opti-
mizations to overcome the memory bottleneck, e.g. strip min-
ing [26], space tiling [27], or linear loop transformations [28].
It is clear from Tables 2 and 3 that TILE WIDTH = 8 seems
to be the “sweet spot” where the number of threads per block
is large enough to eﬀectively hide global memory latency but
not so large that heavy use of thread registers reduces SM
occupancy (due to block granularity of thread assignments).
From our testing, the register-heavy code is always faster (by
≈ 15 − 25%) than the global memory-heavy codes, keeping
TILE WIDTH constant. Interestingly, the Global 8 code turns
out to be faster than the Register 16 code. Our ﬁnding that using
64 threads per block results in the fastest code is puzzling, given
that the NVIDIA CUDA Occupancy Calculator suggests that
our application can accommodate 256 threads/block (which
should, in principle, result in a greater degree of parallelism
and memory latency hiding). Ultimately, these results empha-
size the importance of experimentation to ﬁnd the right balance
of register use and number of threads per block.
5. Conclusions and Discussion
In this paper, we have described in detail our experience port-
ing a two-dimensional MUSCL-Hancock ideal MHD solver to
a single NVIDIA GTX 480 (Fermi architecture) using CUDA
C. We compared a sequential CPU version of the algorithm –
compiled with gcc-4.4.3 and executed on a single Intel Core i7
930 (2.8 GHz) CPU core – to a parallel version running on a
single GTX 480. We made no special eﬀort to stage data in
the much faster shared memory of the GPU; rather, we exper-
imented with diﬀerent combinations of thread register use and
thread block size to maximize the number of active threads per
Streaming Multiprocessor (SM) (and thereby hide memory la-
tency for our memory bound application). For a double preci-
sion problem with 10242 mesh cells, we achieved a maximum
speedup of roughly 126 (with the CPU source code compiled
using the ”-O3” gcc optimization ﬂag) by making use of thread
registers to store kernel variables whenever possible and setting
the number of threads per block to 64. Interestingly, keeping all
of the data in global memory (with 64 threads per block) was
more eﬃcient than using large amounts of register memory in
combination with a larger number (256) of threads per block.
Several other groups (see, for example, Schive et al. [8],
Wong et al. [29] and references therein) have reported on GPU
implementations of MHD codes. The results of Wong et al.
[29] (hereafter W11) are particularly relevant to our work, since
their algorithm is an explicit ﬁnite volume method on a uniform
mesh (in contrast to the AMR simulation tested by Schive et al.
[8]), and their timing tests included a two-dimensional case on
a GTX 480. For problem sizes of 5122 and 10242, W11 quote
speedups (relative to a single 3.2 GHz Intel Nehalem core) of
320 and 600, respectively. For a 1283 double precision problem
(twice the size of their 10242 problem with speedup of 600),
W11 report a speedup of only 155. They report a speedup of
7
(a) t ≈ 0.1 (b) t ≈ 0.2
(c) t ≈ 0.4 (d) t ≈ 0.8
Figure 3: Evolution of the Orszag-Tang vortex for a grid size of 5122.
260 for a 643 problem (still less than the reported 320 speedup
for a 2D problemwith the same number of computational cells).
These speedups are higher than ours for similarly sized prob-
lems, and none are consistent with the simple estimate based
on equation (11).
We speculate that the discrepancy between our speedups and
those reported by W11 is a result of ineﬃcient memory access
in the 1D and 2D versions of the Fortran code they used for
their CPU runs. Speciﬁcally, it is clear from Table 4 of W11
that the ﬂuidy and ﬂuidz operations are much more expensive
than ﬂuidx (due to less eﬃcient memory access for the y and z
ﬂuid updates), whereas these operations take similar time on the
GTX 480 (where memory latency is eﬀectively hidden). Fur-
ther, W11’s Fortran CPU code calls the Transposition opera-
tion, whereas their CUDA code does not. Similarly, Table 5 of
W11 shows that ﬂuidz is much more expensive than ﬂuidx or
ﬂuidy (which are comparable since the Transposition opera-
tion allows eﬃcient memory access in the x and y dimensions)
in the Fortran CPU code, while all three operations take similar
times in the CUDA code. In contrast, ﬂuidx, ﬂuidy and ﬂuidz
all take similar times in the 3D cases (for which the Transposi-
tion operation allows data in all three dimensions to be accessed
eﬃciently).
Thus, it appears that W11’s CPU implementation uses a ﬂuid
update algorithm which is signiﬁcantly less eﬃcient than the
GPU algorithm for 1D and 2D problems. In the 643 case re-
ported in their Table 6, if we ignore the Transposition opera-
tion (not included in W11’s GPU implementation), their GTX
480 single precision speedup is 97 for a problem size of 643.
Using W11’s reported (in their Table 3) double vs. single pre-
cision run time of 1.3973 for their 643 problem, their double
precision speedup is roughly 69, which is much closer to the
double precision speedup (over the optimized C code, compiled
with the ”-O3” gcc option) of 67 we observe for our ”Global 8”
5122 problem (with the same number of computational cells) in
Table 3. Thus, we are conﬁdent that the speedups we report in
this paper are consistent with those reported by W11.
Though these results are encouraging for applications that
ﬁt on a single GPU, real world applications are much larger
than 10242 and generally must be run on thousands of com-
pute nodes. The newest supercomputers (e.g. Keeneland, Blue
Waters, Titan) are being created with a hybrid architecture in
which each node consists of a multi-core CPU and one or more
GPUs. The method presented here is chieﬂy based on minimiz-
ing the amount of data being transfered between the CPU and
GPU, so the natural extension to many nodes is to minimize
both the intra- and inter-node communication. In our approach,
each thread on the GPU is responsible for one computational
8
Problem Size Global 16 Global 8 Global 4 Global 2
642 0.95 s (6.8) 0.63 s (10) 0.82 s (7.9) 1.59 s (4)
1282 2.74 (15) 2.09 (20) 2.52 (17) 5.93 (7)
2562 8.51 (33) 6.16 (45) 8.76 (32) 22.825 (12)
5122 29.92 (49) 22.06 (67) 33.36 (44) 97.46 (15)
10242 113.88 (71) 77.34 (104) 120.31 (67) 357.69 (22)
Table 3: Average run times for permutations of our single GPU ideal MHD code. Global number refers to the codes using a global
memory-heavy approach with the TILE WIDTH set to number. Numbers in parentheses are the approximate speedups relative to
the Optimized C timings in Table 1.
cell; this allows us to divide the overall simulation space among
multiple GPUs since individual cell calculations are indepen-
dent from other cell calculations. In this case, the only inter-
GPU communication required is the boundary cell states prior
to the time evolution and the timestep. Currently, passing data
between GPUs in diﬀerent compute nodes represents a signif-
icant communication bottleneck. We note that with CUDA 5,
however, NVIDIA has implemented “GPUDirect”, which en-
ables direct communication between GPUs in a cluster rather
than routing the data through their associated CPUs with MPI.
Though we have presented results indicating that a single-
GPU code shows signiﬁcant speedup over a single-CPU code,
it is not yet clear whether a multi-GPU code will oﬀer the same
speedup relative to a multi-CPU code. It is possible, however,
that an explicit ﬁnite volume MHD solver might achieve good
weak scaling up to thousands of GPUs using a message pass-
ing approach in which only ghost cell data is communicated
between GPUs (with the rest of the solution remaining on the
GPUs for the entire calculation).
In summary, we have demonstrated that porting a non-trivial
ﬁnite-volume algorithm (a second-order MUSCL-Hancock
ideal MHD solver) to a modern (compute capability 2.0 or
higher) GPU can yield a signiﬁcant speedup compared to a sin-
gle Nehalem core – comparable to that predicted by a simple
estimate based on the arithmetic throughput, memory latency
and parallel processing capacity of the GPU – with minimal
programming eﬀort. While some experimentation with diﬀer-
ent combinations of thread register use and thread block size is
useful, one can in general expect speedups of ≈ 100 for moder-
ately large double precision problems (> 10242 computational
cells). Future work will focus on comparing a multi-GPU MHD
code with a similar multi-CPU MHD code in order to delineate
the beneﬁts and drawbacks of GPUs vs. CPUs in the high-
performance supercomputing realm. For now, we are conﬁdent
in recommending GPUs as a viable speedup tool for small- to
moderate-sized simulations (< 106 cells).
Acknowledgment
The authors would like to thank X. Feng and H.C. Wong for
helpful discussion regarding their timing results.
CB was supported by the Delaware Space Grant program
(NASA Grant NNG05G092H) and the NASA GSRP Fellow-
ship (NASA Grant NNX11AK70H).











int n, nsteps, nskip;
double cfl, time, dt;
double wspeed_max;
FILE *output = fopen(FILEOUTPUT, "w");






//write to aux file





















time = time + 2.*dt;
fprintf(aux, "%d %.10f %.10f\n",
n+1, 2.*dt, time);
printf("%d %f %f\n", n+1,2.*dt,time);











Appendix B. GPU evolvex
void evolvex(double* p_cons, double dt,
double ch){
double dx = LX/(NX-2);
int size_spd = NX*NZ*3*sizeof(double);










































































int a,b,c,n, nsteps, nskip;
double cfl, time, dt, dx, dz;




FILE *aux = fopen(FILEAUX, "w");
double* cons = (double *)calloc(NX*NZ*NVAR,
sizeof(double));
//write to aux file
fprintf(aux, "%d %d\n", NX,NZ);



















movie_dt = TOTAL_TIME/(FRAME_RATE * MOVIE_TIME);









time = time + 2.*dt;
fprintf(aux, "%d %.10f %.10f\n",
n+1, 2.*dt, time);













Appendix D. CPU evolvex
void evolvex(double* cons,
double dt, double ch){
int a,b,c;
double dx = LX/(NX-2);














































































































for (i = 0; i<SIZE; i++){




double time = end_time - start_time;
double latency = time/(SIZE)*2.801e9;
double mflops = SIZE/(time*1e6);
printf("Execution time: %.20f\n", time);
printf("Execution latency: %.20f\n", latency);





















__global__ void add_kernel(int tile_width,






int nx = 1024;
int ny = 1024;
12
int tile_width = 16;
int SIZE = nx*ny;
int N_SM = 15;
int N_T = 32;
int N_W = 48;













double time = end_time - start_time;
/* Note that this measures combined
execution+issue latency on GPU */
double latency=N_T*N_W*N_SM*time/SIZE*1401e6;
double mfl = n/((end_time - start_time)*1e6);
printf("Execution time: %.20f\n", time);
printf("Execution latency: %.20f\n", latency);














__global__ void add_kernel(int tile_width,




int i = blockIdx.x*tile_width + threadIdx.x;
int j = blockIdx.y*tile_width + threadIdx.y;
int idx = ny*i + j;
a[idx] = b[idx] + c[idx];
}
References
[1] D. B. Kirk, W. mei W. Hwu, Programming Massively Parallel Processors,
Morgan Kaufmann, 2010.
[2] R. G. Belleman, J. Be´dorf, S. F. P. Zwart, High performance direct grav-
itational n-body simulations on graphics processing units ii: An imple-
mentation in cuda, New Atronomy 13 (2008).
[3] E. Gaburov, J. Be´dorf, S. P. Zwart, Gravitational tree-code on graphics
processing units: Implementation in cuda, Procedia Computer Science 1
(2012) 1119–1127.
[4] C. Harris, K. Haines, L. Stavely-Smith, Gpu accelerated radio astronomy
signal convolution, Exp. Astron. 22 (2008) 129–141.
[5] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco,
K. Schulten, Accelerating molecular modeling applications with graphics
processors, J. Comput. Chem. 28 (2007) 2618–2640.
[6] M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand,
A. L. Beberg, D. L. Ensign, C. M. Bruns, V. S. Pande, Accelerating
molecular dynamic simulation on graphics processing units, J. Comp.
Chem. 30 (2009) 864–872.
[7] T. Brandvik, G. Pullan, Acceleration of a 3d euler solver using commodity
graphics hardware, 2008. 46th AIAA Aerospace Sciences Meeting and
Exhibit.
[8] H.-Y. Schive, Y.-C. Tsai, T. Chiueh, Gamer: A graphic processing unit
accelerated adaptive-mesh-reﬁnement code for astrophysics, Astrophys.
J. Suppl. 186 (2010) 457–484.
[9] NVIDIA White Paper, NVIDIA’s Next Generation CUDA Compute Ar-
chitecture: Fermi, Technical Report, NVIDIA Corporation, 2009.
[10] NVIDIA CUDA Documentation, CUDA C Programming Guide Version
4.2, NVIDIA Corporation, 2012.
[11] B. van Leer, On the relation between the upwind-diﬀerencing schemes of
godunov, enquist-osher and roe, SIAM J. Sci. Statist. Comput. 5 (1984)
1–20.
[12] A. Dedner, F. Kemm, D. Kro¨ner, C.-D. Munz, T. Schnitzer, M. Wesen-
berg, Hyperbolic divergence cleaning for the mhd equations, J. Compu-
tational Phys. 175 (2002) 645–673.
[13] G. Strang, On the construction and comparison of diﬀerence schemes,
SIAM J. Numerical Anal. 5 (1968) 506–517.
[14] E. F. Toro, Riemann solvers and numberical methods for ﬂuid dynamics:
A practical introduction, Springer-Verlag, 1999.
[15] C. R. Evans, J. F. Hawley, Simulation of magnetohydrodynamic ﬂows -
A constrained transport method, Astrophys. J. 332 (1988) 659–677.
[16] D. S. Balsara, D. S. Spicer, A Staggered Mesh Algorithm Using High
Order Godunov Fluxes to Ensure Solenoidal Magnetic Fields in Mag-
netohydrodynamic Simulations, Journal of Computational Physics 149
(1999) 270–292.
[17] W. Dai, P. R. Woodward, On the divergence-free condition and conserva-
tion laws in numerical simulations for supersonic magnetohydrodynamic
ﬂows, Astrophys. J. 494 (1998) 317–355.
[18] D. Ryu, F. Miniati, T. W. Jones, A. Frank, A Divergence-free Upwind
Code for Multidimensional Magnetohydrodynamic Flows, Astrophys. J.
509 (1998) 244–255.
[19] P. Londrillo, L. del Zanna, On the divergence-free condition in Godunov-
type schemes for ideal magnetohydrodynamics: the upwind constrained
transport method, Journal of Computational Physics 195 (2004) 17–48.
[20] J. M. Stone, T. A. Gardiner, P. Teuben, J. F. Hawley, J. B. Simon, Athena:
A New Code for Astrophysical MHD, Astrophys. J., Suppl. Ser. 178
(2008) 137–177.
[21] D. Lee, A. E. Deane, An unsplit staggered mesh scheme for multidimen-
sional magnetohydrodynamics, Journal of Computational Physics 228
(2009) 952–975.
[22] P. L. Roe, Approximate Riemann Solvers, Parameter Vectors, and Diﬀer-
ence Schemes, Journal of Computational Physics 43 (1981) 357–372.
[23] E. F. Toro, M. Spruce, W. Speares, Restoration of the contact surface in
the HLL-Riemann solver, Shock Waves 4 (1994) 25–34.
[24] P. Batten, N. Clarke, C. Lambert, D. M. Causon, On the choice of
13
wavespeeds for the hllc riemann solver, SIAM J. Sci. Comput. 18 (1997)
1553–1570.
[25] T. Miyoshi, K. Kusano, A multi-state HLL approximate Riemann solver
for ideal magnetohydrodynamics, Journal of Computational Physics 208
(2005) 315–344.
[26] S. Carr, K. Kennedy, Compiler blockability of numerical algorithms, in:
Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, Su-
percomputing ’92, IEEE Computer Society Press, 1992, pp. 114–124.
[27] M. Wolfe, Iteration space tiling for memory hierarchies, in: Proceed-
ings of the Third SIAM Conference on Parallel Processing for Scientiﬁc
Computing, Society for Industrial and Applied Mathematics, 1989, pp.
357–361.
[28] W. Li, K. Pingali, Access normalization: Loop restructuring for numa
computers, ACM Trans. Comput. Syst. 11 (1993) 353–375.
[29] H.-C. Wong, U.-H. Wong, X. Feng, Z. Tang, Eﬃcient magnetohydrody-
namic simulations on graphics processing units with cuda, Comp. Phys.
Comm. 182 (2011) 2132–2160.
14
