1 research outputs found

    Minimizing Startup Costs for Performance-Critical Threading

    No full text
    Abstract—Using the well-known ATLAS and LAPACK dense linear algebra libraries, we demonstrate that the parallel management overhead (PMO) can grow with problem size on even statically scheduled parallel programs with minimal task interaction. Therefore, the widely held view that these thread management issues can be ignored in such computationally intensive libraries is wrong, and leads to substantial slowdown on today’s machines. We survey several methods for reducing this overhead, the best of which we have not seen in the literature. Finally, we demonstrate that by applying these techniques at the kernel level, performance in applications such as LU and QR factorizations can be improved by almost 40 % for small problems, and as much as 15 % for large O(N 3) computations. These techniques are completely general, and should yield significant speedup in almost any performance-critical operation. We then show that the lion’s share of the remaining parallel inefficiency comes from bus contention, and, in the future work section, outline some promising avenues for further improvement. I
    corecore