Publisher's copyright statement:
Introduction
This paper addresses a conflict that many numerical simulations face. While the algorithms strive to reduce the number of unknowns and operations per accuracy via dynamic adaptivity in space and time, the hardware evolution asks for regular data access patterns. Algorithms favour data structures and data access patterns that allow them to invest work where it pays off most. Recent hardware generations favour uniform sequential data access with high arithmetic intensity that allows to pipe data through the cores. The present paper investigates strategies to team up the advantages of adaptive, octree-type meshes with regularly refined patches (blocks). Plain shallow water equations act as test bed for our approach well-suited for numerous partial differential equations (PDEs). The first-mentioned model a wide range of problems of great societal and technical relevance: examples include tsunamis [12] or storm surges on the continental scale, radiation-sensitive cooling processes in manufacturing, as well as flow in blood vessels on the cell scale. Hyperbolic PDEs are often characterised by a multitude of scales in space and time, such that accurate solutions demand for very fine meshes in certain regions yet for a low time to solution, too. Tsunami prediction systems relying on hyperbolic simulations, e.g., have to yield results within minutes.
The multitude of scales of interest for hyperbolic solvers and their local yet transient behaviour in time imply that efficient computational meshes for these problems need to be dynamically adaptive. Furthermore, local time stepping is important where individual subgrids march in time with different time step sizes determined by the wave propagation speed. The finer the granularity of the adaptivity in both space and time, the "better" is the algorithm-at least in terms of the required number of unknowns and arithmetic operations.
If we express solvers with fine granular adaptivity in stencil notation, a large variety of computationally cheap stencils matching multiple local mesh refinement configurations is required. An application of a series of such stencils in turn exhibits non-uniform data access. However, modern multi-and manycore systems offering many hardware threads and broad vector facilities yield the best throughput for algorithms with low memory footprint and high arithmetic intensity that are split into a vast number of homogeneous tasks. This conflict of interest renders hyperbolic solvers on adaptive Cartesian grids a prototype challenge for novel high-performance computing architectures.
In the presented work, our meshes result from a k-spacetree formalism [16, 18] with k = 2 yielding a quadtree in two dimensions, where regular Cartesian gridswe denote them as blocks-are embedded into the leaves of the tree. Such a scheme facilitates dynamic, structured block adaptivity where the adaptivity leads to a low computational effort/memory footprint per accuracy ratio while a decent block size allows us to exploit vectorisation and loop parallelism. On the blocks, we apply the f -Wave wave propagation method to solve the Riemann problems with uniform vectorised stencils [1, 4, 12] . More sophisticated solvers fit to the scheme seamlessly. Yet, any increase in computational demands streamlines the challenge to design a high-throughput algorithm. The inter-block coupling is realised through bilinear conservative stencils in space and time from [14] . Similar techniques are proposed in [6, 7, 13] , e.g., for other challenges.
If the size of the blocks can be chosen freely, multiple spacetrees induce the same adaptive Cartesian grid-with a regular grid being a special case of an adaptive one. As a rule of thumb, big blocks induce high computational throughput. Small blocks in turn facilitate fine granular adaptive meshes. The latter gains importance if the application faces hard memory constraints or if local time stepping is realised on a per-block basis. In practice, one has to choose a block size compromise. In the present paper, we start from kernels of fixed size and study their computational potential with respect to vectorisation and shared memory parallelisation of their loops (intra-block parallelism). For an artificial test case problem, we also relate the block size selection to adaptive mesh refinement with local time stepping and a concurrent processing of multiple blocks by multiple threads (inter-block parallelism).
Experiments show that the reduction in computational efficiency due to small block sizes is not always compensated by the reduction of total work due to adaptivity in space and time. It also becomes evident that the previously mentioned rule is invalid for big block sizes on some architectures and that the choice of one proper parallelisation variant or a hybrid depends on the block size and number of cores available. These insights do not answer which block size to select or what strategy to follow if some instationary regions of the grid require huge regular grids while others require very accurately trimmed adaptive meshes. We hence formalise the grid traversal as automaton running through the spacetree and augment this automaton with an analysed tree grammar [5] . Whenever the automaton encounters a set of spacetree leaves whose blocks can be fused into one bigger regular Cartesian block within the adaptive spacetree paradigm, these leaves are replaced accordingly-if the performance studies permit. This optimisation does not constrain the adaptivity pattern: once the grid refines in regions fused into a big regular grid, the automaton decomposes the block again. The proposed technique falls into the class of autotuning of stencil codes for multicore SIMD architectures. It offers several selling points: The fusion of the blocks is hidden from the kernels just specified over block sizes. It does not increase the implementation complexity of the application code. The identification of grid regions well-suited to be fused is embedded into the tree traversal and anticipates dynamic adaptivity. It follows the grid evolution rather than opposing optimisation restrictions on the numerical algorithm's choice of discretisation. The loop fusion increases the algorithmic homogeneity of the data access pattern. It is independent of optimisations increasing the algorithmic intensity [8] though it can be combined with these. Finally, the optimisation can be tailored to distinct hardware characteristics without changing the compute kernels. Compared to our previous work [3] , we detail the discussion with insights on the Xeon Phi architecture and particularities of the algorithms underlying the runtime tuning. New are an in-depth study of the two underlying parallelisation strategies with respect to hardware characteristics and the analysis of pessimistic vs. optimistic time stepping. We also start to put runtime improvements into relation to the mean life time of regular block assemblies.
Three research hypotheses drive the present paper:
(1) In terms of walltime, adaptivity with small block sizes as atomic mesh motif are not by nature superior to more regular grids. There are setups where a higher throughput of regular grids at least levels out a mesh cell increase. (2) For these rather regular setups, the present approach exploits the studied architectures in terms of vector registers and cores. To study the methodology, getting as close to peak as possible is irrelevant. But the best throughput for the most advantageous block size acts as reference value and thus has to mirror the machine capabilities. (3) Block fusion brings together the two advantages. It allows the application to select reasonably small block sizes while a high throughput is retained.
The remainder is organised as follows: We introduce our mesh formalism in combination with an abstract presentation of the unknown updates in Section 2 before we study the computational kernels in Section 3. The two orthogonal parallelisation schemes are sketched afterward. In Section 5, we introduce the block fusion. Some numerical results (Section 7) follow the description of our experiments in Section 6 and reveal the potential of the scheme. They also validate the underlying assumptions. A brief outlook and a summary in Section 8 close the discussion.
Shallow Water Equations on Spacetrees
Let (0, 1) × (0, 1) ⊂ R 2 be the bounding box of the computational domain. We cut this domain equidistantly into k parts along each coordinate axis. This yields k 2 non-overlapping cubes of the same size. If we continue this splitting recursively while we decide per cube autonomously whether to refine or not, we end up with an adaptive Cartesian grid. Let C be the set of all cubes resulting from the construction process. The refinement operation induces a parent-child relation on C where each cube has either k 2 or no children at all. Cubes without children are leaves from the set C L ⊆ C. The bounding box is the root.
The parent-child relation is a directed tree graph on C where a node's level is the path length from the root to the node. As the nodes of this graph are cubes, i.e. spatial elements, this tree is a k-spacetree [16] . k = 2 gives the special case of a quadtree. The height h of a spacetree is the length of the longest path in the graph. For the trivial spacetree with C = C L = {(0, 1) × (0, 1)}, we end up with height zero. All experiments of the present work are based upon the PDE framework Peano [17] and thus use k = 3. We hence omit the parameter k from now on and refer to that data structure variant as spacetree (Figure 1) .
Volume-based discretisations of hyperbolic equations-or partial differential equations in general-such as finite volumes or finite elements directly yield stencils on any adaptive Cartesian grid induced by a spacetree formalism. While a direct spacetree-based stencil or system matrix derivation offers great flexibility with respect to the adaptivity, efficiency considerations as well as the intention to reuse existing software fragments suggest to add an additional mapping n : C L → N that embeds an equidistant Cartesian mesh with n(c) × n(c) cells into each spacetree leaf. n ≡ 1 embeds a trivial grid of one cell into each leaf, i.e. each spacetree leaf is a cell of the computational grid Ω h . In return,
is equivalent to an n ≡ 1 spacetree grid created in two steps: we take a spacetree and embed an additional regular spacetree of height into each leaf.
In the present paper, we start from a fixed n(c) = n ≥ 2 ∀c ∈ C L , and call the embedded regular Cartesian grids blocks. The spacetree then defines a blockstructured adaptive Cartesian grid Ω h , and it yields a non-overlapping domain decomposition of Ω h . If we extend each n×n block by a halo layer ofn cells, we obtain an overlapping domain decomposition. The unknowns per cell are denoted as q.
Copy data from q old of surrounding blocks into ghost cell entries of q old 3:
Evaluates neighbours of cell i.
5:
end for Loop also determines ∆t.
6:
for i ∈ {n, n +n − 1} × {n, n +n − 1} do 7:
Time step update.
8:
end for 9: end for 10: Switch q old and q new for all updated blocks Given a stencil code mapping (n + 2n) × (n + 2n) unknowns onto new values within the n × n grid and intergrid operators mapping a n × n grid onto the halo layer of another grid, we can run over the spacetree's finest level, befill the halo layers of each individual block and update the unknowns (Algorithm ??).
The Shallow Water Block Update Kernels
In the present paper, we solve the shallow water equations on q = (h, u, v) given as
h denotes the height of the water column (water depth), u and v encode the momentum in x-and y-direction, and g is the gravitational constant (g := 9.81 m/s 2 ). The source term S(t, x, y) models effects of varying ocean depth (bathymetry) or frictional or Coriolis forces. In this paper, we neglect them. Our solver routines [2] realise an explicit finite volume scheme where a pair of unknown triples q is assigned to each cell of the grid to allow us to store the previous and the current time step.
The six values are accompanied by a time stamp of the newer value plus the time interval spanned by the two solutions.
This leads to two computational kernels executed in each time step per block as soon as the halo layer also describing global boundary conditions is initialised:
• Computation of net updates: For each cell, we derive from the neighbouring cell quantities q old the net updates ∆Q h , ∆Q u , ∆Q v . They determine the impact of waves on the cell quantities that enter or leave the respective grid cell through the edges. In the classical formulation, this step also derives from the wave speeds the biggest time step size ∆t that one can chose without violating the CFL condition.
• Updating the unknowns: For each cell, the quantities in q then are then updated according to the balance equation
The loop kernel to compute the net updates approximately solves a Riemann problem on each edge, i.e. solves the one-dimensional analogon of equation (2) for a piecewise constant initial condition given by the quantity vectors q l and q r obtained from the two adjacent cells. As approximate Riemann solver we follow a wave propagation approach by [4, 12] , which determines the net updates from socalled f -waves, which are computed from a locally linearised Riemann problem. Our code provides a careful implementation of the resulting f -wave solver that allows auto-vectorisation [1] . The resulting net update kernel has a computational intensity, ratio of floating-point operations vs. accessed bytes of memory, of around two. In contrast, (3) has the ratio 2/3. The ghost data exchange between different blocks is a copy of rectangular grid fragments as long as all blocks march in time with the same time step size and induce the same mesh size. Each ghost cell then coincides with an inner cell of an adjacent block and q old from there is copied into the ghost cells' q old . If blocks can advance in time with different speed due to local time stepping, we have to interpolate linearly in time to determine the ghost layer's q old from q old and q new from the adjacent block. For adaptive grids, we have to interpolate linearly both in time and space when we initialise the ghost layers, as the CFL condition makes the maximum time step size scale linearly with the mesh size. To facilitate this, we have to store per block the current and the previous time step. It is hence a natural choice to reuse half of these records to hold ∆Q = (∆Q h , ∆Q u , ∆Q v ) throughout the block update. Such an in-situ scheme allows us to write the block updates without additional temporary data structures. Technical details and a runtime model are given in [14] .
Water height h and bathymetry usually allow us to predict the maximum time step size sharply. It is hence a natural choice to rewrite the pessimistic scheme from above determining the time step size from ∆Q into an optimistic variant: Here, we anticipate the scaling in (3) and merge the two algorithmic steps. If the net update computation afterward reveals that the CFL condition has been harmed, we can roll back the solution and restart the update with a halved time step size as q old remains available. Roll-backs have never been observed for the present experiments.
Concurrency, Vectorisation and Parallelisation
The classic block-wise processing scheme exhibits two independent levels of concurrency. For the following discussion, we rely on a recursive formulation of the spacetree traversal (Algorithm ??) where a push-back automaton traverses the tree mirroring a depth-first search and invokes the block updates on all leaves. Here, both the spacetree and unknown traversal exhibit concurrency. parallel for c c do Inter-block
Copy from q old of surrounding blocks into ghost cells of q parallel for i ∈ {n, n +n − 1} × {n, n +n − 1} do Intra-block 7:
end parallel for
parallel for i ∈ {n, n +n − 1} × {n, n +n − 1} do Intra-block 10:
12:
If necessary: rollback and restart computation with ∆t/2 
Upstream, our traversal scheme exhibits concurrency on the block level. Children of one spacetree node can be processed concurrently for equidistant time stepping, as their kernel result depends only on q old . For local time stepping or adaptive grids where interpolation in time is required, write races however occur. These are resolved by red-black colouring [5] or multiscale red-black colouring [15] . Given a maximum hardware concurrency level p, we can update up to p blocks descending from a refined node c in parallel. If we apply that scheme to SIMD, one vector register updates entries from p different blocks a time. Such an inter-block vectorisation equals a partial loop permutation of the inner and outer loop in Algorithm ??. If applied to multiple threads, each of the p threads handles a distinct block before all threads continue with the subsequent p blocks. This is inter-block parallelism.
We can sophisticate the latter's block synchronous scheme by a task-based formalism where the dependencies between blocks stemming from local time stepping and adaptivity are represented by a graph. It is then up to the scheduler to resolve potential block update races [9] . Task-based parallelism here however is overengineering. Our runtime per block depends linearly on the number of unknowns, i.e. the p block updates run equally long. This can change for more sophisticated Riemann solvers [1] where the update time per cell depends on entries in q old .
Block Fusion
Obviously, the concurrency levels of the intra-and inter-block parallelisation depend on n. The bigger n the higher the intra-block concurrency. Big ns however make the underlying spacetree shallower for a given Ω h . Consequently, the bigger n the smaller the inter-block concurrency. Once n is fixed, both concurrency levels are fixed for a given grid. We introduce a marker M on all spacetree nodes that characterises both the concurrency and the regularity of the grid locally:
We observe from (1) that we can take any node c in the spacetree with M (c) > 0 and replace all the blocks within the nodes deriving from c with a new block in c with k M (c) n × k M (c) n cells. Such a replacement preserves Ω h . Our block fusion algorithmic then reads as follows: First, traverse the spacetree with Algorithm ??. Embed the computation of (4) into this traversal, i.e. compute the markers on-the-fly. Second, whenever the traversal encounters M (c) > 0 in subsequent traversals, embed a corresponding patch into this spacetree node c. Continue to traverse, but copy all n × n leaf blocks overlapping with the new fused block into the new data structure once the time step is completed. Flag these leaves afterwards and free their blocks. Third, whenever one encounters a flagged leaf, initialise the ghost data of the fused block on a coarser level instead of the ghost data associated to the leaf. Pointers to these data are inherited recursively throughout the traversal. Finally, whenever the call stack of the recursive traversal is reduced and it automaton ascends through a fused block, update this one. The fusion's interplay with dynamic adaptivity is obvious. If a spacetree leaf whose block got fused into a larger one identifies that the grid changes, it breaks down the fused block back into its n × n components. Such a break down is simple copying.
Block fusion enables us to shift the concurrency profile of a grid with small n from a high inter-block concurrency towards high intra-block concurrency on-the-fly. In practice, fusing wherever possible is not a good strategy. But it does make sense to establish a performance model predicting whether fusion along M (c) levels pays off or it is better to fuse fewer levels along the spacetree hierarchy, and then to make use of the fusion paradigm. We also point out that any throughput improvement due to fusion in combination with a reduction of halo cell copying-within a fused block, (k M (c) − 1) 2 fewer halo cell faces have to be exchanged per subsequent spacetree traversal, e.g.-first of all has to amortise a certain fuse overhead comprising the copying of block data into the fused block and back if the grid structure changes. Finally, we reiterate that fusion has an impact on local time stepping if the time stepping is realised per block. All three ingredients depend on solver, setup and hardware and thus fuse decisions should not be generic. The present paper hence highlights insights that can guide fusion. It does not study one particular fusion strategy.
Experiment Setup
The present experiments focus on Sandy Bridge and Xeon Phi processors instructed by the Intel compiler 14.0.2. Shared memory parallelisation is realised through Intel's TBB. The Sandy Bridge-EP Xeon E5-2650 processor with 2×8 cores and 4×16 GByte RAM at 2.0 GHz acts as driver for two Xeon Phi 5110P with 8GByte at 1.054 GHz. Experiments either run on the two 8-core processors or on one coprocessor in native mode. Hybrid runs, runs with two Phi coprocessors or runs on clusters of such setups are beyond scope. All figures are normalised runtimes (throughput) given as cell updates per second, i.e. computations of q new divided by walltime. They include administrative cost such as maintaining the spacetree structure, determining (4) or fuse/inverse fusion copy overhead and are obtained in single precision.
Our testbed is an artificial wave propagation scenario starting from Figure 2 as initial water height and applies simple settings: The dynamic refinement criterion evaluates the maximum slope between the four corner points of each block and refines the corresponding leaf if this slope exceeds 0.01. It coarsens grid regions if the maximum slope of all contained blocks underruns 0.001. Our bathymetry is constant everywhere. We linearly scale the time step with the mesh size to meet the CFL condition as we set it to 0.001 · 3 − . is the level of a block. All meshes are constrained by a minimum mesh size. If Ω h is regular, this is the cell width. If Ω h is adaptive, we start from a 3n × 3n grid and make the refinement criterion refine constrained by the fact that no mesh cell may underrun the minimum mesh size.
All experimental setups rely on the Peano framework [17] and thus rely on threepartitioning. With k = 3, we start from n ∈ {6, 12} as smallest block sizes. They are the smallest configurations where interpolation and restriction at grid resolution boundaries simplify as centers of coarse cells coincide with cell centers of finer and ghost meshes. For our search for advantageous block sizes in the huge n parameter space, we multiply these basic values with three mirroring k step by step.
Results
We first track the number of cell updates over time for different block sizes on adaptive grids (Figure 3) . The smaller the block size n the smaller is the number of cell updates at a particular time as the total number of cells is the smaller the finer the adaptive granularity and as blocks with big cell sizes advance in time prior to neighbouring blocks with small cells. For the latter, we observe the expected 3:1 pattern: per time step of a coarse cell, the next finer cells have to perform three time steps. The impact of dynamic adaptivity is inrecognisable on a logarithmic scale for such a short observation interval. We perceive that a reduction of the block size by a factor of nine ("two block sizes smaller") reduces the total number of cells around a factor of two. It reduces the memory footprint. The reduction pairs up with the fact that the few coarse cells have to be updated less frequently than fine blocks. Our patch update measurements reflect text book expectations and knowledge and do not contribute any new insight. However, the measurements characterise the interplay of block choice and workload footprint as well as workload homogeneity for the following experiments.
Observation. For the present experiments, a reduction of the block size reduces the number of arithmetic operations to run the simulation with a given accuracy.
We next study the arithmetic throughput on a single core. Hereby, we distinguish Xeon Phi from Sandy Bridge, optimistic from pessimistic time stepping and kernels vectorised with #pragma simd from kernels without annotation. For all measurements in this paper, the spacetree management required a single digit percentage of the overall runtime. We hence focus on the impact of manual SIMD-sation on the compute intensive block updates and the ghost cell exchange. Furthermore, only intra-block vectorisation made a performance difference whatever the setup. We ascribe this to the reduced bandwidth available to each vector register in this scheme and do not study it further though it might play a role on future architectures providing data gather and scatter such as AVX2.
For regular grids, we observe that the explicit usage of simd pragmas yields a speedup of around four on the Sandy Bridge as soon as the block size exceeds n = 128 (Table 1) . For smaller blocks, the vectorisation is robust and increases linearly with the patch size. However, the loop ranges are too small to exploit all vector registers. Switching from pessimistic to an optimistic time stepping gives another five percent. From hereon, only optimistic time stepping is studied further. The measurements on the Xeon Phi reveal qualitatively the same behaviour ( Table 2) . The impact of the vectorisation however is-enabled by the hardware-twice as big.
All results are in accordance with [1] working with higher clock rates. We continue to rerun the experiments on adaptive grids where we constraint the finest mesh width to 1/26,244. Our measurements in Table 3 track the throughput, the maximum spacetree height and the number of blocks averaged over an observation interval of t = 0.001. Sandy Bridge preserves the regular grid throughput for reasonable big block sizes, for small n (36, e.g.) adaptive grids give around 70 percent of the regular throughput. Vectorisation yields a speedup of up to four. The adaptive grid even yields a slightly higher throughput than the regular case, maybe due to the tiling's positive impact on cache reuse [11] . Xeon Phi behaves differently. Adaptivity halves the throughput of the vectorised code while the variant without vectorisation is adaptivity invariant. Interpolation and restriction along resolution boundaries in this application field are predominantly memory movements, cannot be vectorised, and might cause cache misses though the measurements do not reveal the exact reason for this breakdown. From hereon, all experiments study the adaptive case. Observation. The present code can exploit vector instruction sets. On single cores, it is advantageous to make n as big as possible.
The simple "bigger patches-better throughput" paradigm becomes invalid once shared memory parallelisation is enabled. Selected runtimes are shown in Figures 4  and 5 . On Sandy Bridge, we obtain the best throughput with n = 324. Up to this size, the impact of the inter-block parallelisation outweighs the intra-block benefit. Block sizes beyond 486 have a higher intra-block concurrency than an inter-block scheme. Starting from this size, the throughput becomes the worse the bigger n. Hyperthreading does not pay off. The Phis again behave differently. Two threads per core, i.e. half the physical thread count, here are the configuration of choice, while one out of 60 cores is reserved for the operating system. Furthermore, the throughput improves linearly with the block size. No deterioration threshold is observed. For reasonably big problem sizes, the coprocessor finally overtakes its host mainly due to its ability to exploit the intra-block parallelism.
Observation. For ns saturating the scalability, we obtain an efficiency of around 7/16 on the Sandy Bridge and 10/60 on Xeon Phi. As the algorithm has low arithmetic intensity, we can state that it scales.
For real setups, plain throughput measurements are misleading. Instead, we have to put the throughput in relation to updates required, i.e. to science per flop [10] . In this case, the sweet spot moves to the left: it champions smaller block sizes. Although such a metric is strongly problem-dependent, we can derive examples from the present experiments. For a fixed scenario, we frequently observe a halving of computational load when we reduce the block size by one ninth. Whenever the throughput increases by more than a factor of two due to an increase of n by a factor of nine, it pays off to run for the bigger block size right away if memory permits. We observe such a behaviour on the Xeon Phi when the throughput of ≈ 5.2 · 10' 8 for n = 1458 drops to a throughput smaller than 1.5 · 10 8 for n = 162 (not shown).
Observation. If scenarios cannot reduce the number of unknowns due to adaptivity significantly, it sometimes pays off to choose extensive large n to reduce the total time to solution.
We finally study block fusion. Previous experiments characterise the scalability and vectorisation suitability of certain block sizes. We know the potential gain of fusion. However, a quantification of the fusion overhead so far is missing: we do not know how expensive it is to move from one grid fragment into a fused representation, and how the fusion's fragmentation of the ghost layers affects the runtime. Transition cost means memory movements as the computation of (4) is computationally insignificant. These costs depend on the mean life time (mlt) of the blocks, i.e. the number of time steps before a fused blocks is destroyed again by the adaptivity criterion. While we switch off multicore parallelisation to study the fusion overhead-otherwise overhead might be hidden behind other effectsvectorisation can have an impact as the vector architectures provide wide moves. Our studies focus on one esemble of small blocks withn = 12 in the grid. It is chosen such that the blocks can be fused into one block of size n ∈ {36, 324} ( Figure 6 ). Results for other configurations yield similar results. We start from the throughput for the n configuration (no-fusion) as best-case result and successively break it down into smaller blocks until we end up with the original block sizen = 12. For n, we obtain results in-between the span of regular to adaptive experiments. They are better than strongly adaptive runs as they average over the observation time and the grid becomes more regular. They are worse then regular runs as the grid is adaptive. For mlt ≥ 8, we observe that the block transition cost is amortised, i.e. the fused throughput becomes saturated. For l = 1, i.e. the fusion of 3×3 blocks, our results recover the throughput of the big block sizes. A similar reasoning holds for the fusion of 27 × 27 blocks. If we fuse bigger esembles, the fusion stagnates with two third of the best-case throughput on Sandy Bridge. On the Xeon Phi, we obtain just half of the throughput.
Observation. Due to on-the-fly block fusion we can preserve the throughput of block sizes that are up to nine times bigger than the chosen size.
Conclusion and Outlook
The present paper studies a dynamically adaptive shallow water equation solver that starts from an adaptive spacetree and embeds blocks of fixed size into the spacetree's leaf nodes. Our results confirm the natural intuition that the throughput of the solver is, as rule of thumb, the better the bigger the block sizes and they show that the two introduced parallelisation concepts have the potential to exploit current hardware. As small block sizes are desirable due to memory and total timeto-solution considerations, we propose an on-the-fly identification of regular grid regions and fuse the blocks there into big data chunks. A study on the overhead of this loop fusion reveals that dynamic block fusion helps to harvest scalability and vectorisation characteristics of big blocks sizes though the algorithm may work with small blocks of small memory footprint that track the solution characteristics accurately. However, this statement holds if there are reasonably structured, i.e. regular, grid regions and the mean time to reconstruction is not excessively small.
The proposed performance studies rely on one simple partial differential equation solver. However, we consider our algorithmic principles to be of potential relevance for a broader community. When the techniques are adopted, it is however of relevance to switch from manually tuned kernels to automatically tuned loop assemblies as generated by modern stencil compiler, i.e. to benefit from the combination of the present approach improving the arithmetic data access homogeneity with techniques increasing the arithmetic intensity [8] . One enabling code feature for this future work-ghost layers withn ≥ 2-is sketched.
