Abstract. High-performance routines of BLAS (Basic Linear Algebra Subprograms) are constantly required in the field of numerical calculations. We have implemented DL-BLAS (Dynamically Load-balanced BLAS) to enhance the performance of BLAS when other tasks use CPU resources of multi-core CPU architectures. DL-BLAS tiles matrices into submatrices to make subtasks and dynamically assigns tasks to CPU cores. We found that the dimensions of submatrices used in DL-BLAS affect the performance. To attain high-performance we have to solve an optimization problem where variables are the dimensions of the submatrices. The search space of the optimization problem is so vast that exhaustive search is unrealistic. We propose an auto tuning search algorithm which consists of Diagonal Search and Reductive Search. Our auto tuning algorithm provides semi-optimal parameters in realistic computing time. Using our algorithm, we got parameters which gave us the best performance in most of cases. As a result, DL-BLAS reached higher performance than ATLAS and GotoBLAS in many performance evaluation tests
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.