1 research outputs found
Constructing Performance Models for Dense Linear Algebra Algorithms on Cray XE Systems
Hiding or minimizing the communication cost is key in order to obtain good
performance on large-scale systems. While communication overlapping attempts to
hide communications cost, 2.5D communication avoiding algorithms improve
performance scalability by reducing the volume of data transfers at the cost of
extra memory usage. Both approaches can be used together or separately and the
best choice depends on the machine, the algorithm and the problem size. Thus,
the development of performance models is crucial to determine the best option
for each scenario. In this paper, we present a methodology for constructing
performance models for parallel numerical routines on Cray XE systems. Our
models use portable benchmarks that measure computational cost and network
characteristics, as well as performance degradation caused by simultaneous
accesses to the network. We validate our methodology by constructing the
performance models for the 2D and 2.5D approaches, with and without
overlapping, of two matrix multiplication algorithms (Cannon's and SUMMA),
triangular solve (TRSM) and Cholesky. We compare the estimations provided by
these models with the experimental results using up to 24,576 cores of a Cray
XE6 system and predict the performance of the algorithms on larger systems.
Results prove that the estimations significantly improve when taking into
account network contention