6 research outputs found

    Graph Expansion and Communication Costs of Fast Matrix Multiplication

    Full text link
    The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size MM, too small to store three nn-by-nn matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms, Ω((nM)ω0M)\Omega((\frac{n}{\sqrt M})^{\omega_0}\cdot M), where ω0\omega_0 is the exponent in the arithmetic count (e.g., ω0=lg7\omega_0 = \lg 7 for Strassen, and ω0=3\omega_0 = 3 for conventional matrix multiplication). With pp parallel processors, each with fast memory of size MM, the lower bound is pp times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation)

    Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms

    Get PDF
    In this paper we study the impact of the simultaneous exploitation of data- and task-parallelism, so called mixed-parallelism, on the Strassen and Winograd matrix multiplication algorithms. This work takes place in the context of Grid computing and, in particular, in the Client-Agent(s)-Server(s) model, where data can already be distributed on the platform. For each of those algorithms, we propose two mixed-parallel implementations. The former follows the phases of the original algorithms while the latter has been designed as the result of a list scheduling algorithm. We give a theoretical comparison, in terms of memory usage and execution time, between our algorithms and classical data-parallel implementations. This analysis is corroborated by experiments. Finally, we give some hints about heterogeneous and recursive versions of our algorithm
    corecore