The key common bottleneck in most stencil codes is data movement, and prior
research has shown that improving data locality through optimisations that
schedule across loops do particularly well. However, in many large PDE
applications it is not possible to apply such optimisations through compilers
because there are many options, execution paths and data per grid point, many
dependent on run-time parameters, and the code is distributed across different
compilation units. In this paper, we adapt the data locality improving
optimisation called iteration space slicing for use in large OPS applications
both in shared-memory and distributed-memory systems, relying on run-time
analysis and delayed execution. We evaluate our approach on a number of
applications, observing speedups of 2× on the Cloverleaf 2D/3D proxy
application, which contain 83/141 loops respectively, 3.5× on the linear
solver TeaLeaf, and 1.7× on the compressible Navier-Stokes solver
OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of
CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's
Knights Landing, demonstrating maintained throughput as the problem size grows
beyond 16GB, and we do scaling studies up to 8704 cores. The approach is
generally applicable to any stencil DSL that provides per loop data access
information