We explore the utilization of the Apache TVM open source framework to
automatically generate a family of algorithms that follow the approach taken by
popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in
order to obtain high-performance blocked formulations of the general matrix
multiplication (GEMM). % In addition, we fully automatize the generation
process, by also leveraging the Apache TVM framework to derive a complete
variety of the processor-specific micro-kernels for GEMM. This is in contrast
with the convention in high performance libraries, which hand-encode a single
micro-kernel per architecture using Assembly code. % In global, the combination
of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves
portability, maintainability and, globally, streamlines the software life
cycle; 2)~provides high flexibility to easily tailor and optimize the solution
to different data types, processor architectures, and matrix operand shapes,
yielding performance on a par (or even superior for specific matrix shapes)
with that of hand-tuned libraries; and 3)~features a small memory footprint.Comment: 35 pages, 22 figures. Submitted to ACM TOM