General Matrix-Matrix Multiplication Using SIMD features of the PIII by Aberdeen, Douglas & Baxter, Jonathan
ar
X
iv
:1
91
2.
04
37
9v
1 
 [c
s.P
F]
  1
8 N
ov
 20
19
General Matrix-Matrix Multiplication using
SIMD features of the PIII
Douglas Aberdeen and Jonathan Baxter
Research School of Information Sciences and Engineering
Australian National University
Abstract. Generalised matrix-matrix multiplication forms the kernel of
many mathematical algorithms. A faster matrix-matrix multiply imme-
diately benefits these algorithms. In this paper we implement efficient
matrix multiplication for large matrices using the floating point Intel
Pentium SIMD (Single Instruction Multiple Data) architecture. A de-
scription of the issues and our solution is presented, paying attention
to all levels of the memory hierarchy. Our results demonstrate an av-
erage performance of 2.09 times faster than the leading public domain
matrix-matrix multiply routines.
1 Introduction
A range of applications such as artificial neural networks benefit from GEMM
(generalised matrix-matrix) multiply routines that run as fast as possible. The
challenge is to use the CPUs peak floating point performance when memory
access is fundamentally slow. The SSE (SIMD Streaming Extensions) instruc-
tions of the Intel Pentium III chips allow four 32-bit (single precision) floating
point operations to be performed simultaneously. Consequently, efficient use of
the memory hierarchy is critical to being able to supply data fast enough to
keep the CPU fully utilised. In this paper we focus on the implementation of
an efficient algorithm for the Pentium SIMD architecture to achieve fast, large
matrix-matrix multiplies. Our code has been nicknamed Emmerald.
Without resorting to the complexities associated with implementing Strassen’s
algorithm on deep-memory hierarchy machines [5], dense matrix-matrix mul-
tiplication requires 2MNK floating point operations where A : M × K and
B : K ×N define the dimensions of the two matrices. Although this complexity
is fixed, skillful use of the memory hierarchy can dramatically reduce overheads
not directly associated with floating point operations. It is the optimization of
memory hierarchy combined with the SSE that gives Emmerald its performance.
Emmerald implements the SGEMM interface of Level-3 BLAS, and so may
be used immediately to improve the performance of single-precision libraries
based on BLAS (such as LAPACK [4]). There have been several recent attempts
at automatic optimization of GEMM for deep-memory hierarchy machines, most
notable are PHiPAC [3] and the more recent ATLAS [6]. ATLAS in particular
achieves performance close to vendor optimized commercial GEMMs. Neither
ATLAS nor PhiPAC make use if the SSE instructions on the PIII for their
implementation of SGEMM.
Our experiments showed that ATLAS achieves a peak of 375 MFlops/s for
single-precision multiplies on a PIII @ 450 MHz, or 0.83×clock rate. Our matrix-
matrix multiply using SIMD instructions achieves a peak of 890 MFlops/s, or
1.98×clock rate. We also report an application with a price/performance ratio
under USD $1/MFlop/s. The following section will describe our novel use of
the SSE for Emmerald, followed by a description of optimizations designed to
improve use of the memory hierarchy. The papers concludes with a comparison
of results between ATLAS and Emmerald.
2 SIMD Parallelisation
Two core strategies are employed to minimise the ratio of memory accesses to
floating point operations: accumulate results in registers for as long as possible
to reduce write backs, and re-use values in registers as much as possible. In
[2] several dot-products were performed in parallel as the innermost loop of
the GEMM. We took the same approach and found experimentally that 5 dot-
products in the inner loop gave the best performance. Figure 1(a) shows how
these 5 dot products utilise SIMD parallelism.
Four values from a row of A are loaded into an SSE register. This is re-used
five times by doing SIMD multiplication with four values from each of the five
columns of B used in the inner loop. Two SSE registers are allocated to loading
values from B. Results are accumulated into the remaining five SSE registers.
When the dot-product ends each SSE result register contains four partial dot-
product sums. These are summed with each other then written back to memory.
For the best efficiency, the dot product length is maximised with the constraint
that all data must fit into L1 cache.
3 Memory Hierarchy Optimizations
A number of standard techniques are used in Emmerald to improve performance.
Briefly, they include:
– L1 blocking: Emmeralduses matrix blocking [2,3,6] to ensure the inner loop
is operating on data in L1 cache. Figure 1(b) shows the L1 blocking scheme.
The block dimensions m and n are determined by the configuration of dot-
products in the inner loop (Section 2) and k was determined experimentally.
– Unrolling: The innermost loop is completely unrolled for all possible lengths
of k in L1 cache blocks, taking care to avoid overflowing the instruction
cache.
– Re-buffering: Since B′ is large (336 × 5) compared to A′ (1 × 336), we de-
liberately buffer B′ into L1 cache. By also re-ordering B to enforce optimal
memory access patterns we minimise translation look-aside buffer misses [6].
Iteration 2
Iteration 1
4 5 6 7 1 2 1
1 2 14 5 6 7
x
m
m
0xmm3
x
m
m
0xmm3 xmm1  2
xmm1  2    
BC A
BAC
(a) Allocation of the eight SSE registers
(xmm[0-7]), showing progression of the dot
products which form the innermost loop of
the algorithm. Each circle is an element in
the matrix. Each dashed square represents
one floating point value in an SSE register.
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    





















   
   
   



  
  
  
  




          
          


buffered
 k=336
 k
A’ B’
BAC
 m=1
M
A’
 B’ C’
 n=5
N
(b) L1 blocking for Emmerald: C′ ←
A
′
B
′ where A′ and B′ are in L1 and
C
′ is accumulated in registers.
– Pre-fetching: Values from A′ are not in L1 cache. We make use of SSE pre-
fetch assembler instructions to bring A′ values into L1 cache when needed.
– L2 Blocking: Efficient L2 cache blocking ensures that peak rates can be
maintained as long as A, B and C fit into main memory.
4 Results
The performance of Emmerald was measured by timing matrix multiply calls
with size M = N = K = 16 up to 700. The following steps were taken to ensure
a conservative performance estimate: wall clock time on an unloaded machine
is used rather the CPU time; the stride of the matrices (which determines the
separation in memory between each row of matrix data) is fixed to 700 rather
than the length of the row; caches are flushed between calls to sgemm(). Timings
were performed on a PIII 450 MHz running Linux (kernel 2.2.12).
Figure 2 shows Emmerald’s performance compared to ATLAS and a naive
three-loop matrix multiply. The averageMFlop/s rate of Emmerald after size 100
is 1.69 times the clock rate of the processor and 2.09 times faster than ATLAS.
A peak rate of 890 MFlops/s is achieved when m = n = k = stride = 320. This
represents 1.97 times the clock rate. The largest tested size was m = n = k =
stride = 3696 on a 550 MHz machine which ran at 940 MFlops/s.
We have used Emmerald in distributed training of large Neural Networks with
more than one million adjustable parameters and a similar number of training
examples [1]. By distributing training over 196 Intel Pentium III 550 MHz proces-
sors, and using Emmerald as the kernel of the training procedure, we achieved
a sustained performance of 152GFlops/s for a price performance ratio of 98¢
USD/MFlop/s (single precision).
0
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
Mf
lop
s @
 45
0M
Hz
Dimension
Emmerald Performance
Emmerald
ATLAS
naive
Fig. 2. Performance of Emmerald on a PIII running at 450 MHz compared to
ATLAS sgemm and a naive 3-loop matrix multiply. Note that ATLAS does not
make use of the PIII SSE instructions.
5 Conclusion
This paper has presented Emmerald, a version of SGEMM that utilises the SIMD
instructions and cache hierarchy of the Intel PIII architecture. An application
demonstrating the cost-effectiveness of such work was also reported. This code
and the full version of this paper is available from http://beaker.anu.edu.au/
research.html.
References
1. D. Aberdeen, J. Baxter, and R. Edwards. 98¢ /mflop, ultra-large-scale neural-
network training on a piii cluster. Sumbitted to SC2000, May 2000.
2. B. Greer and G. Henry. High performance software on Intel Pentium Pro proces-
sors or Micro-Ops to TeraFLOPS. Technical report, Intel, August 1997. http://
www.cs.utk.edu/ ∼ghenry/sc97/paper.htm.
3. J.Bilmes, K.Asanovic, J.Demmel, D.Lam, and C.W.Chin. PHiPAC: A portable,
high-performace, ANSI C coding methodoloogy and its application to ma-
trix multiply. Technical report, University of Tennessee, August 1996.
http://www.icsi.berkeley.edu/ ∼bilmes/phipac.
4. Netlib. Basic Linear Algebra Subroutines, November 1998. http://www.netlib.org/
blas/index.html.
5. M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning strassen’s matrix multi-
plication for memory efficiency. In Super Computing, 1998.
6. R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimiza-
tions of software and the atlas project. Technical report, Computer Science De-
partment, University of Tennessee, March 2000. http://www.cs.utk.edu/ rwha-
ley/ATLAS/atlas.html.
