1 research outputs found
Implementing Strassen's Algorithm with BLIS
We dispel with "street wisdom" regarding the practical implementation of
Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional
wisdom: it is only practical for very large matrices. Our implementation is
practical for small matrices. Conventional wisdom: the matrices being
multiplied should be relatively square. Our implementation is practical for
rank-k updates, where k is relatively small (a shape of importance for
libraries like LAPACK). Conventional wisdom: it inherently requires substantial
workspace. Our implementation requires no workspace beyond buffers already
incorporated into conventional high-performance DGEMM implementations.
Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our
implementation requires no such workspace and can be plug-compatible with the
standard DGEMM interface. Conventional wisdom: it is hard to demonstrate
speedup on multi-core architectures. Our implementation demonstrates speedup
over conventional DGEMM even on an Intel(R) Xeon Phi(TM) coprocessor utilizing
240 threads. We show how a distributed memory matrix-matrix multiplication also
benefits from these advances