88 research outputs found
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
High-performance SVD partial spectrum computation
We introduce a new singular value decomposition (SVD) solver
based on the QR-based Dynamically Weighted Halley (QDWH) algorithm for computing the partial spectrum SVD (QDWHpartial-SVD)
problems. By optimizing the rational function underlying the algorithms in the desired part of the spectrum only, the QDWHpartial-SVD
algorithm efficiently computes a fraction (say 1-20%) of the leading
singular values/vectors. We develop a high-performance implementation of QDWHpartial-SVD 1 on distributed-memory manycore
systems and demonstrate its numerical robustness. We perform a
benchmarking campaign against counterparts from the state-of-theart numerical libraries across various matrix sizes using up to 36K
MPI processes. Experimental results show performance speedups
for QDWHpartial-SVD up to 6X and 2X against vendor-optimized
PDGESVD from ScaLAPACK and KSVD on a Cray XC40 system
using 1152 nodes based on two-socket 16-core Intel Haswell CPU,
respectively. We also port our QDWHpartial-SVD software library
to a system composed of 256 nodes with two-socket 64-Core AMD
EPYC Milan CPU and achieve performance speedup up to 4X compared to vendor-optimized PDGESVD from ScaLAPACK. We also
compare energy consumption for the two algorithms and demonstrate how QDWHpartial-SVD can further outperform PDGESVD
in that regard by performing fewer memory-bound operations
- …