We propose an algorithm that aims at minimizing the inter-node communication
volume for distributed and memory-efficient tensor contraction schemes on
modern multi-core compute nodes. The key idea is to define processor grids that
optimize intra-/inter-node communication volume in the employed contraction
algorithms. We present an implementation of the proposed node-aware
communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate
that this implementation achieves a significantly improved performance for
matrix-matrix-multiplication and tensor-contractions on up to several hundreds
modern compute nodes compared to conventional implementations without using
node-aware processor grids. Our implementation shows good performance when
compared with existing state-of-the-art parallel matrix multiplication
libraries (COSMA and ScaLAPACK). In addition to the discussion of the
performance for matrix-matrix-multiplication, we also investigate the
performance of our node-aware communication algorithm for tensor contractions
as they occur in quantum chemical coupled-cluster methods. To this end we
employ a modified version of CTF in combination with a coupled-cluster code
(Cc4s). Our findings show that the node-aware communication algorithm is also
able to improve the performance of coupled-cluster theory calculations for
real-world problems running on tens to hundreds of compute nodes.Comment: 15 pages, 4 figure