Concurrency and data locality for sparse linear algebra on modern processors

Abstract

Graphics processing units (GPUs) are used as accelerators for algorithms in which the same instructions are carried out on different data. Algorithms for sparse linear algebra can achieve good performance on GPU, although they tend to have an irregular pattern of accesses to memory. The performance of these algorithms is highly dependent on input data. In fact, the parallelism these algorithms can achieve is limited by the opportunities for concurrency given by the data. Focusing on the solution of sparse riangular linear systems of equations, this thesis shows that a good partitioning of the data and a good scheduling of the computation can greatly improve performance on GPUs. For this class of algorithms, a partition of the data that maximises concurrency in the execution does not necessarily achieve the best performance. Instead, improving data locality by reducing concurrency reduces the latency of memory access and consequently the execution time. First, this work characterises the problem formally using graph theory and performance models. Then, algorithms that can be used effectively to partition the data are described. These algoritms aim to balance concurrency and data locality automatically. This approach is evaluated experimentally on the solution of linear equations with the preconditioned conjugate gradient method. Also, the thesis shows that the proposed approach can be used in the case when a matrix changes during the execution of an algorithm from one iteration to the other, like in the simplex method. In this case, the approach proposed in this thesis allows to update the partition of the matrix from one iteration to the other. Finally, the algorithms and performance models developed in the thesis are used to discuss the limitations of the acceleration of the simplex method with GPUs.Open Acces

    Similar works