In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ∼ 2 and an energy saving of ∼ 2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ∼ 5 compared to the same algorithm on a CPU node.
Introduction
As the performance of high-performance computing (HPC) systems tends towards exascale, the characteristics of hardware are not shaped by floatingpoint operation (FLOP) count and memory bandwidth alone. Energy costs continue to rise in an era where climate change is a factor in many decision making processes, and power consumption and energy efficiency are therefore prominent concerns of hardware manufacturers and end-user software developers alike. According to a report by the US Department of Energy [1] , power consumption of a potential exascale machine is predicted to increase by a factor of three relative to HPC system designs in 2010. Indeed, an exascale-capable machine built with the technology existing in 2010 would be expected to cost more than $2.5 billion USD in power expenditure alone [1] and consume at least a gigawatt of power [2, 3] .
The need for improved energy efficiency in HPC has led to the Green500 list [4] . In addition to listing the raw compute capability of a system, Green500 ranks systems by their power efficiency via the FLOPS/Watt metric. The impact of heterogeneous systems which combine CPU and accelerators is clear to see. Despite the improvements brought about by new architectures, a 2013 review of the Green500 list [5, 6] concluded that the goal of a 20MW exaflop machine in 2020 by the Defence Advanced Research Projects Agency (DARPA) would not be met with current hardware trends. A machine of this standard would require 1000 times the performance of a current petascale system, but at equivalent levels of power consumption.
Graphical Processing Unit (GPU) and Many Integrated Core (MIC)-based architectures, have augmented conventional CPU-based systems and are currently a popular choice to increase the FLOP count. However there is uncertainty in the future of architecture design for HPC machines.
From the perspective of numerical simulations, it is quite probable that in addition to the hardware wall-time restrictions commonly seen in institutional, national and international HPC systems, simulations will also be constrained by energy budgets. It is therefore paramount that numerical codes and their underlying solution algorithms can be optimised in both respects. However, while such improvements can be made on both a hardware and a software level, these can often entail extensive re-writes of the codebase. Drastic changes may be required in order to target a different backend Application Programming Interface (API) and introduce optimisations for a particular piece of hardware.
In the approach to exascale computing, recent research has looked at the energy efficiency of various codes and architectures (see e.g. [7] ). The energy efficiency of sparse matrix multiplication on a GPU, Intel Xeon Phi and Field Programmable Gate Arrays (FPGAs) was investigated by [8] . The FPGAs and accelerators were found to have excellent energy efficiency for computing sparse matrix kernels, suggesting that off-loading certain tasks to specialised hardware components could be a viable option in future systems.
Matrix multiplication benchmarks offer some insight into the energy efficiency of different architectures, but are not always representative of largescale codes comprised of multiple components and different workloads. Energy efficiency studies of large-scale scientific codes for N-body molecular dynamics simulations have been investigated by [9, 10] . They concluded that any reduction in runtime gives proportional energy savings.
The use of low-power Advanced RISC Machine (ARM)-based processors in the context of fluid dynamics simulations with spectral element, finite element and Lattice-Boltzmann-based codes was investigated by [11] , focussing on the trade-off between time-to-solution and energy-to-solution. They found that the ARM processors are more energy efficient than traditional x86-based CPUs.
The energy saving of a second-order 3D finite-difference code for solving the acoustic diffusion equation (a homogeneous parabolic partial differential equation) on an Intel Xeon CPU, Intel Xeon Phi co-processor and two NVIDIA GPU cards compared to a sequential implementation has been investigated by [12] . They concluded that the energy consumed by the GPU is the lowest and CPU is highest, with the Phi in between. They also concluded that energy consumption depends on the runtime of the simulations. Techniques to further improve energy efficiency for GPU architectures (e.g. better use of caches and improved CPU-GPU workload division) are surveyed in [1] .
Besides changing the architectures it is also possible to change the programming model. A comparison of parallel programming paradigms (OpenCL, OpenACC, OpenMP and CUDA) was studied by [13] , assessed on their performance and energy consumption on a variety of architectures. For each application considered, the execution time was the dominant factor in the overall energy consumption, with implementations offering the fastest runtime once again yielding greater energy efficiency.
Finite difference methods for the solution of partial differential equations are used in many areas of science and engineering, such as wave propaga-tion modelling, acoustic diffusion, heat conduction, fluid dynamics, micro magnetics, seismic imaging, plasma flows, electromagnetics, and magnetohydrodynamics (MHD) flows to name a few. Typically large multi-dimensional systems of partial differential equations (PDE) need to be solved. Recent effort in optimising such stencil-based codes has focussed on designing efficient solvers with the help of tiling, and improving data locality, etc. For example, the reader is referred to [14, 15] and the references therein. Most of these works involve optimising the existing codebases, however it is also possible to introduce more radical algorithmic changes. Recent work [16] using the automatic source code generation capabilities of the OpenSBLI framework [17] has demonstrated that varying the computational and memory intensity of explicit finite difference algorithms for solving the compressible Navier-Stokes equations in three dimensions on a representative test case, has a significant impact on simulation runtime on an Intel Ivy Bridge CPU node. However, not much is known about how these algorithms behave across various architectures with respect to their energy efficiency and power consumption. This paper investigates the performance, in terms of both runtime and energy efficiency, of six algorithms (five of which are from [16] ) on an Intel Ivy Bridge CPU node, an NVIDIA K40c GPU, and an Intel Xeon Phi Knights Landing processor (denoted KNL for brevity from here on in) for a compressible Navier-Stokes test case. Section 2 introduces the numerical method and the algorithms under consideration, along with the automatic source code generation framework (OpenSBLI) used to generate C code implementations for each algorithm. In Section 3, a description of the various backend programming models and the hardware used is given, along with a brief discussion of how the power consumption and energy usage were measured. The results are presented in Section 4 and demonstrate that the algorithms requiring minimal memory access are consistently the best-performing algorithms across the three different architectures. The paper closes with some concluding remarks in Section 5.
Methodology
The governing equations are the non-dimensional compressible NavierStokes equations with constant viscosity, given by
and
for the conservation of mass, momentum and energy, respectively. Note that the convective terms are written in the skew-symmetric formulation of [18] and the Laplacian form of the viscous terms is used to improve the stability of the equations [19, 20, 21] . Subscripts are the Einstein indices and repeated indices imply summation. The quantity ρ is the fluid density, u i is the velocity vector, p is pressure, E the total energy, Re is the Reynolds number, T is temperature, M is the Mach number, Pr is the Prandtl number, γ is the ratio of specific heats and x i is the coordinate system. The pressure and temperature relations are given by,
respectively.
Spatial discretisation
The spatial derivatives in the governing equations are evaluated using a fourth-order central finite-difference method. The first and second order partial derivatives of an arbitrary function (f ) in the x 0 coordinate direction, with a grid spacing of h 0 are given by,
where i is now the grid point's index. The numerical derivatives of the function in the other coordinate directions can be evaluated in a similar fashion. As periodic boundary conditions are used in the present work, no special treatment of the grid points on the boundary are required.
Temporal discretisation
The variables that are advanced in time ( Q = [ρ, ρu i , ρE] T ) are referred to as the solution variables. In the present work a three-stage low storage Runge-Kutta method is used to advance the solution variables in time; this can be written as
where R( Q) represents the numerical evaluation of the spatial derivatives (referred to as the residual), n is the time-level, ∆t is the timestep, k = 1,2,3 is the sub-stage of the Runge-Kutta timestepping loop, and the solution vector at the next time-level (n+1) is Q n+1 = Q 3 .
Algorithms
Pseudo-code for the spatial and temporal discretisation of the governing equations is outlined in Figure 1 . Most of the computational time is spent evaluating the residual. This consists of (a) evaluating the primitive variables (u i , P, T ), (b) evaluating the spatial derivatives using equations 6 and 7 (63 derivatives are to be computed), and (c) to compute the residual of the equations. The algorithms considered in this work differ in the way the spatial derivatives are evaluated and the residuals are computed. The other steps are evaluated the same way across all the algorithms, as in [16] . Note that all the steps in Figure 1 are applied on the entire domain (grid size) except the application of periodic boundary conditions, which are applied over each boundary face.
The different algorithms are outlined below. The source code for the implementation of each algorithm is generated automatically using the OpenS-BLI framework [17] .
apply-periodic-boundary end for // runge-kutta-substep end for // timestep 
Baseline (BL)
The baseline algorithm follows the conventional approach frequently used in large-scale finite difference codes [21] . For example, to evaluate the derivative of a 3D array (f) in the x 0 coordinate direction, a typical Fortran implementation would look like:
r e a l : : f a c t x i n t e g e r : : This subroutine takes any arbitrary array f as input and the derivative is evaluated by looping over all the grid points (nx0, nx1, nx2) in the domain by applying equation 6 and stored to the global 'work array' df.
Due to ease of programming such individual subroutines are typically written for the first and second derivatives in different coordinate directions. In order to evaluate the 63 derivatives in the equations, a subroutine is called depending on the coordinate direction and order of the derivative, and the output is stored into arrays. Some of the derivatives in the equations require a combination of variables, (for example, ∂(ρu 2 * u 0 )/∂x 0 ); first the variable ρu 2 * u 0 is evaluated into a temporary array, which is then passed as an input to the subroutine to obtain the derivative and store it to an array. Once all the spatial derivatives are evaluated and stored into arrays, the residual of the equation is computed by replacing the continuous derivatives with their corresponding arrays.
The implementation of the BL algorithm in OpenSBLI follows this approach, and requires (without any reuse of arrays) 63 arrays for the derivatives. This is the least computationally intensive algorithm that is tested.
A variant of this algorithm was considered in order to reduce the number of arrays used. This was done by splitting up the spatial terms in the governing equations into two groups: The terms that feature the Reynolds number Re as one group and the remaining terms as another. The residuals are evaluated by adding the residual of each individual group. This version resulted in a reduction in the number of arrays to 40. However, as the reduction in runtime was less than 2% and the optimisations being specific to the compressible Navier-Stokes equations, they are not considered in the present work.
Intermediate storage algorithms
Two variants of the BL algorithm that reduce the number of arrays by increasing the computational intensity of the algorithm were considered. As the components of the velocity gradient tensor (∂u i /∂x j ) are reused across different equations, these 9 derivatives are evaluated in a similar fashion to the BL algorithm (i.e. calling separate subroutines and storing the derivative to an array). For the remaining spatial derivatives, two different options are considered, resulting in the following algorithms.
Recompute Some (RS)
In the RS algorithm the remaining spatial derivatives are evaluated onthe-fly directly in the residual. For example, consider the residual (right hand side) for equation 1. In the RS algorithm the remaining derivatives are replaced by their finite-difference stencil (equation 6) which is evaluated onthe-fly, as opposed to evaluating the derivative to an array for later retrieval. In other words, the derivatives that are not stored must be recomputed each time regardless of whether they appear multiple times across the compressible Navier-Stokes. The pseudocode for the evaluation of the residuals is given in Figure 2 . This algorithm results in lengthy expressions for the residuals.
Residual (mass) = discretisation for ∂ρu j /∂x j + ρ (arrays of ∂u j /∂x j ) + discretisation for u j ∂ρ/∂x j Residual (momentum) = ... Residual (Energy) = ... 
Store Some (SS)
In the SS algorithm, the derivatives (except for the velocity derivatives ∂u i /∂x j ) are evaluated on-the-fly and stored as thread-/process-local variables. These variables can then be re-used across different equations. The residuals are computed using a combination ofthese local variables and the stored arrays for the velocity gradient evaluation. This algorithm requires storing 54 thread-/process-local variables, as opposed to 54 global arrays in BL. The computational intensity of this algorithm is lower than RS algorithm. No attempt was made to sort the evaluation of local variables (e.g. grouping all the derivatives based on coordinate direction or grouping the derivatives according to velocity components).
Minimal storage algorithms
We now consider variants of the BL algorithm that do not store any partial derivatives as arrays. They are either recomputed on-the-fly, or evaluated and stored as thread-/process-local variables. No extra arrays are used in these algorithms and three different versions are considered.
Recompute All (RA)
This algorithm follows a similar approach to that of the RS algorithm. The difference is that the first derivatives of velocities and their dependant mixed-derivatives are also recomputed on-the-fly. This results in one single large expression to evaluate each residual and all the residuals are evaluated in a single subroutine.
Store None (SN)
All the partial derivatives are evaluated on-the-fly and stored to 63 thread-/process-local variables, which can be re-used across the equations. Similar to SS algorithm, the evaluation of derivatives are not sorted. The residuals are computed using these local variables.
Store None version 2 (SN2)
This follows the same approach as the SN algorithm. However, the derivatives are grouped based on velocity components such that all the derivatives using u i are grouped based on the index i, and the remaining derivatives are placed in another group. First, the groups that contain velocity derivatives are evaluated in the order of the index i, and then the non-velocity group is evaluated. The residuals are computed in a similar fashion to that of the SN algorithm. This algorithm not reported in [16] is used to investigate the effects of data locality on different architectures, as a GPU's performance is known to be affected by data locality [1] .
Source code generation for various architectures
Newer hardware has the potential to reduce the runtime of existing numerical modelling software. However, most models are not in a position to readily exploit such architectures to their full potential. Porting often requires a non-trivial rewrite of the source code and newer architectures might also arrive during this process, resulting in further challenges for the numerical modeller. In an effort to future-proof models in the face of evolving hardware, a new finite difference modelling framework OpenSBLI, was introduced by [17] . OpenSBLI automatically generates C code (which performs the discretisation, computes the solution and diagnostic fields, etc) from a high-level problem specification, using basic components of the symbolic Python package SymPy [22] . As the numerical discretisation is performed symbolically by the OpenSBLI framework, generating the source codes for the different algorithms requires high-level information, such as the derivatives to store (e.g. the velocity components for the SS and RS algorithms), and whether to evaluate derivatives to thread-/process-local variables on-the-fly or recompute them. The ease at which the algorithms can be modified is one of the advantages of the automatic source code generation framework.
To target different back end architectures, we use the source-to-source translation capabilities of the OPS library [23, 24, 25] to tailor the generated code towards different parallel hardware backends, such as MPI for CPUs, CUDA and OpenCL for GPUs, and OpenACC for heterogeneous hardware configurations. OpenSBLI writes out C code that is compliant with the OPS API. The backend models used in the current work are described in section 3.
A summary of the algorithms is presented in Table 1 with the operation count per timestep. This is obtained by counting the number of operations for each expression in the algorithm using the count ops functionality in SymPy and adding them together. This sum is then multiplied by the total number of grid points (256 3 used in this work) and the number of sub-stages in a Runge-Kutta iteration (3 in the current work) to arrive at the operation count per timestep. Operation count is an indication of the compute intensity of the algorithms. It should also be noted that, as different compilers optimise differently, the actual operation count performed by the resulting binary executable would vary. Expensive divisions are avoided by evaluating rational numbers and the inverse of constants at the start of the program. A complete list of the optimisations performed is reported in [16] .
BL RS SS

Hardware and execution
OpenSBLI uses OPS's source-to-source translation capabilities to tailor and optimise the algorithm for their execution on various architectures. On the CPUs and KNLs a variety of programming models such as MPI, MPI+OpenMP, OpenACC, inlined MPI, and Tiled may be used. As it is beyond the scope of this paper to compare different programming models, we select the MPI model on the CPUs and KNL. This yields a typical manuallyparallelised solver with a similar performance for parallelisation as reported in [23] .
Similarly, for GPUs one can use CUDA, OpenCL, OpenACC or MPI CUDA for multiple GPUs. Our experience [26] is that CUDA performs as well as OpenCL, so the CUDA backend is used for the execution of algorithms on GPUs.
The CPU simulations are performed on the UK National Supercomputing Service (ARCHER). This is a Cray XC30 system with each CPU compute node comprising two Intel 12-core E5-2697 v2 Ivy Bridge CPUs, each with a clock frequency of 2.7 GHz and 64 GB of RAM shared between them [27] .
The KNL simulations were also performed on ARCHER since it comprises of Intel Xeon Phi Knights Landing (KNL) 7210 processors (with a clock frequency of 1.30 GHz, 16 GB of high-bandwidth MCDRAM, and 96 GB of DDR4 RAM) with 64 cores.
In the case of the GPU runs, all simulations were performed on a single NVIDIA Tesla K40c installed on a local desktop computer. This comprises 2,880 CUDA stream cores, and runs at a clock speed of 745 MHz with 12 GB of on-board memory [28] .
Performance measurement
For measuring the performance and energy consumption by the algorithms on various architectures, the performance measurement libraries for those architectures were used. On the CPU and KNL, the PAT MPI library (v 2.0.0) [29] was used. PAT MPI library uses the Cray PAT Application Programming Interface (API), giving easy control and requiring minimal modifications to the source code to collect the performance parameters. It should be noted that the actual performance measurement is done by the Cray PAT library and PAT MPI library calls ensure that the number of MPI processes reading the hardware performance counters is minimised, and the master (rank 0) collects and outputs the performance data [29] .
Using the PAT MPI library involves adding calls to the pat mpi monitor in the source code at the point where performance measurements should be taken and the quantities that are to be monitored are controlled by a Makefile. More details can be found in [29, 30, 31] .
In the present work, for all the algorithms on CPU and KNL, pat mpi monitor was added at the start and end of each timestep. The simulation runtime, instantaneous power (W) and cumulative energy (J) were recorded at the pat mpi monitor function locations, as shown in Listing 2. The measurement resolution on the XC30 system was 0.1 seconds [29] which was fine enough for the time between iterations in the present simulations. f o r ( i n t i t e r =0; i t e r < n i t e r ; i t e r ++){ // main time l o o p p a t m p i m o n i t o r ( 1 ) ; f o r ( i n t s t a g e =0; s t a g e < 3 ; s t a g e ++){ / * E va l ua te t h e r e s i d u a l s . .
* / } p a t m p i m o n i t o r ( 2 ) ; } Listing 2: Pseudocode showing the locations of energy monitoring calls
On the GPU the NVIDIA Management Library (NVML) [32] was used. The function nvmlDeviceGetPowerUsage was inserted into the code at the start and end of each iteration to measure the power consumption.
Compilation
After the insertion of performance measurement calls, the code was tailored to the target hardware backends using OPS's source-source translation capabilities. For the CPU and KNL architectures the Intel C/C++ compiler (v 17.0) was used, and the NVIDIA C compiler (v 8.0.44) was used for the GPU. On all architectures we used the -O3 compiler optimisation flag combined with architecture-specific flags (-xHost, -xMIC-AVX512 and sm 35 compute architecture for CPU, KNL and GPU, respectively) to generate binary code specific to each architecture. These are summarised in 
Results
To evaluate the performance of the algorithms, a 3D compressible TaylorGreen vortex problem was simulated. This is a suitable and widely-used test case as it has simple initial and boundary conditions, but is complex enough to thoroughly validate the numerical method and its accuracy. It starts with a sinusoidal velocity field and develops into turbulent flow generating smaller and smaller vortices. The equations were solved in a 3D cube (0 ≤ x 0 ≤ 2πL, 0 ≤ x 1 ≤ 2πL, and 0 ≤ x 2 ≤ 2πL) with periodic boundary conditions in all directions [33, 34] . The initial conditions are given by The simulation was set-up using 256 grid points in each direction. The BL, RS, SS, RA and SN algorithms are validated by [16, 17] using a Cray compiler on the ARCHER CPU node. The results of the algorithms using the current compilers on different architectures are compared to the solution dataset of [35] . The minimum and maximum error in the solution variables between [35] and the current runs was found to be of the order of machine precision on all three architectures.
To evaluate the energy efficiency of the algorithms, the simulations are run for 500 iterations with energy measurement enabled. Each algorithm is repeated five times on each architecture and the performance results presented here are averaged over the five runs. In the following sections the runtime, power and energy usage of the algorithms are investigated for each architecture.
CPUs
The CPU simulations are run using 24 MPI processes on a single ARCHER compute node. Table 3 show the runtimes and speed-up relative to the BL algorithm on CPU. The variation between the runtimes for different runs for each algorithm was less than 1% from the average. Table 3 : Runtime (in seconds) and speed-up (relative to BL) for all algorithms on the CPU using 24 MPI processes on ARCHER.
All other algorithms outperform the BL case and a speed-up of (∼1.7-2.2) relative to BL was achieved. Similar to the findings of [16] , it is worth noting that the SS algorithm achieved the lowest runtime on the same ARCHER CPU hardware. In the present case the runtimes of the minimal storage algorithms (SN, RA and SN2) are within 2% of each other. The difference in the runtime relative to the results reported by [16] was likely due to different compilers being used (the Cray C compiler version 2.4.2 in [16] , versus the Intel C compiler in the present work), which caused different optimisations to be performed at compile-time.
Figure 3(a) shows the power consumption of the algorithms for each iteration. The steady rise in power at early times, which is common to all runs, is due to Dynamic Voltage and Frequency Scaling (DVFS) seen in other studies of power consumption on CPUs [31] . The BL algorithm consumes the least amount of power per iteration whereas RA consumes the most. The higher power consumption for the compute intensive algorithms is expected due to the larger number of operations being performed. Note that SN and SN2 consume less power than RA since they reuse some of the thread-/processlocal variables. With respect to the intermediate storage algorithms, RS consumes less power relative to SS. This is quite different to the behaviour of the minimal storage algorithms (where more calculations lead to increased power consumption), and demonstrates that in order to be energy efficient, a balance needs to be achieved between the number of computations performed and the amount of local-storage used. Figure 3(b) shows the evolution of cumulative energy for each algorithm. As this represents both power consumption and run-time, the lower the value, the more energy efficient the algorithm is. In this figure the different classes of algorithms can be easily identified, irrespective of the power consumption behaviour. BL uses the highest amount of energy, whereas all the compute intensive algorithms are close to each other, consuming the least amount of energy. The intermediate storage algorithms consume more energy than the compute-intensive algorithms but still far less energy than BL. The energy saving of an algorithm, defined as the ratio of energy used by the BL algorithm to the energy used by that algorithm are 1.0, 1.73, 1.76, 2.12, 2.24, 2.18 for BL, RS, SS, RA, SN and SN2 respectively.
From the performance of different algorithms on the CPU we can conclude that: (a) the more computations that an algorithm performs with minimal read/write to RAM, the less energy is consumed; (b) energy consumption of an algorithm is directly proportional to the speed-up relative to BL; (c) the traditional way of writing large-scale finite difference codes (i.e. BL) is not optimal both in terms of runtime and energy consumption; (d) high power consumption by an algorithm does not necessarily imply that the algorithm is energy inefficient; and (e) improving the data locality of the thread-/processlocal variables has negligible effect on both runtime and energy consumption, which is expected as CPUs have relatively large caches.
KNLs
Simulations were run in parallel on the KNL using 64 MPI processes. Hyper-threading was not enabled, in order to keep the configuration on the KNL similar to that of the CPU. It was also found to have negligible effect on runtime in this case. As the memory used by the algorithms for the grid size considered can fit in the 16GB of on-chip high bandwidth memory (MCDRAM memory), the entire MCDRAM is utilised in the cache memory. Table 4 : Runtime (in seconds) and speed-up (relative to BL) for all algorithms on the KNL using 64 MPI processes. Table 4 gives the runtime of the algorithms for each run and the speedup achieved relative to BL. Similar to the CPU runs, the variation between individual runs of the same algorithm is ∼ 1-2% from the mean. A speed-up of ∼ 1.8 is achieved for all the algorithms relative to BL. No significant variation in speed-up/runtime is seen between the low and intermediate storage algorithms; this might be due to the relatively small grid size and also the use of high band-width memory. However, the trend is similar to that seen in the CPU (i.e. BL is the slowest, the compute intensive algorithms are the fastest, and intermediate storage algorithms are in-between). With increased grid size the differences might be more pronounced. It is also interesting to note that the SN2 algorithm, which is the same as SN except that the local variable evaluation is reordered, was 2.5% faster than the SN algorithm, unlike the CPU case where SN2 was slower than SN. Figure 4 (a) shows the power consumption per iteration. Unlike CPUs, there is not much effect of DVFS at early times. The BL algorithm consumes the highest amount of power on KNL, which might arise from the exchanges The following can be concluded from the results: (a) BL is the worst performing in terms of energy and runtime; (b) the energy consumption is proportional to the speed-up achieved by the algorithm relative to BL; (c) the effects of re-ordering of derivatives in the SN2 are apparent and became more pronounced with larger grid sizes; (d) high power consumption by an algorithm does not necessarily imply that the algorithm is energy inefficient, and measuring power alone is not the best way to interpret the energy efficiency of an algorithm. Table 5 shows the runtime of the algorithms on GPUs. Similar to the CPU and KNL, all algorithms perform better than BL. For the current grid size used, no specific trend can be found between low storage and intermediate storage algorithms. The fact that RA is slower than SS, which is not seen on other architectures, shows the difficulty of coding numerical methods on a GPU as these architectures are sensitive to data locality. This can be inferred Table 5 : Runtime (in seconds) and speed-up (relative to BL) for all algorithms on the NVIDIA Tesla K40 GPU using CUDA.
GPUs
from the runtimes of SN and SN2, with the only difference between them being improving data locality which gives a ∼ 30% reduction in runtime. The order in which we write the evaluations while performing compute-intensive operations does indeed have an effect on the runtime on a GPU as shown in [1] . On the GPU, the SN algorithm uses less power due to the lack of data locality. If we ignore the SN algorithm from the current discussion, then BL uses less power and interestingly SS uses more power. The power usage by the intermediate storage algorithms is higher than that of the most computationally-intensive algorithm (RA), again indicating that data locality has a critical impact on GPU performance. Figure 5(b) shows the evolution of cumulative energy. Similar to the CPU and KNL, the BL algorithm consumes the highest amount of energy. If we set aside the SN algorithm from the discussion, the SN2 algorithm uses the least amount of energy. Even though the runtime of RA is higher than the runtime of SS by 1.5%, the energy consumption of RA is less than SS by 5%. This means that RA algorithm can be improved further by writing the large expression in such a way that data locality is enhanced. The energy savings of the algorithms are 1.0, 1.89, 2.01, 2.1 and 2.16 for BL, RS, SS, RA and SN2 respectively, showing a similar trend to the CPU and KNL.
We can conclude the following for GPU: (a) the performance of BL is inefficient both in terms of runtime and energy; (b) data locality is inversely related to the power consumed; (c) care should be taken while optimising algorithms on GPU to improve data locality; (d) reading and writing to arrays should be minimised, even though the computational intensity of an algorithm increases; (e) all the algorithms consume half the amount of energy relative to BL.
Summary and conclusions
The route to exascale presents additional challenges to the numerical modeller. Not only is it crucial to consider runtime performance of the algorithms that underpin finite-difference codes, it is also important for model developers to consider the energy efficiency of their codes. This paper has highlighted the benefits of using an automated code generation framework such as OpenSBLI to readily vary the computational and memory intensity of the finite difference algorithms in order to evaluate their performance.
To summarise the findings, Figure 6 shows the energy consumed per iteration and the average power consumption per iteration for each algorithm on the three architectures considered. The CPU node use more power per iteration than a KNL or GPU. GPUs use about 50% less power than a CPU node and the KNL systems are in-between. The trend is similar for energy consumption per iteration on three architectures, which agrees with the findings of [12] , in which a second order central finite difference scheme was used for solving the acoustic diffusion model equation; CPUs use more energy and GPU use less and Xeon Phi (Knights Corner) processors are in-between. The conclusions from the present work include: (a) Across all architectures, the low storage/ high compute intensity algorithms are more energy efficient and gives the best performance; (b) Runtime reduction reduces the energy used by the algorithm; (c) For all the algorithms considered CPUs are the least energy efficient and GPUs are the most energy efficient, with the KNL architecture in-between; (d) For the high compute intensity algorithms, the energy saving of KNL and GPU are ∼ 2 and ∼ 5 compared to the CPU node; (e) For optimising the simulations on GPU, care should be taken to improve data locality. lations" (grant reference 671571). DJL was supported by an EPSRC Centre for Doctoral Training grant (EPSRC grant EP/L015382/1). The data behind the results presented in this paper will be available from the University of Southampton's institutional repository. The authors acknowledge the use of the UK National Supercomputing Service (ARCHER), with computing time provided by the UK Turbulence Consortium (EPSRC grant EP/L000261/1).
