The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor