Vectorized register tiling by Berna Juan, Alejandro et al.
VECTORIZED REGISTER TILING
Abstract:
In the last years, there has been much effort in commercial compilers (icc, gcc) to exploit  
efficiently the SIMD capabilities and the memory hierarchy that  the current  processors 
offer.  However,  the  small  numbers  of  compilers  that  can  automatically  exploit  these 
characteristics achieve in most cases unsatisfactory results. Therefore, the programmers 
often need to apply by hand the optimizations to the source code, write manually the code 
in  assembly  or  use  compiler  built-in  functions  (such  intrinsics)  to  achieve  high 
performance.  In  this  work,  we  present  source-to-source  transformations  that  help 
commercial compilers exploiting the memory hierarchy and generating efficient SIMD code 
which can be applied in an automated way. Results obtained on our experiments show 
that our solutions achieve as excellent performance as hand-optimized vendor-supplied 
numerical libraries (written in assembly). Our source-to-source transformations are based 
on   the  tiling,  strip-mining,  scalar  replacement  and unroll  and  jam transformations.  In 
particular, we apply these transformations to loop nests and show their effectiveness; the 
tiling transformation permits us to exploit the reuse at the register bank, the strip-mining 
transformation  helps  us  applying  outer  loop  vectorization,  the  unroll  and  jam 
transformation permits unrolling vectorized outer loops and finally, the scalar replacement 
concept is applied to vectorized loops to obtain what we call vector replacement (scalar 
replacement applied to vector registers). We have compared the performance obtained 
with  these transformations against  what  MKL and ATLAS get  and concluded that  it  is  
possible to achieve high performance in numerical applications applying only source-to-
source transformations and letting the compiler to do the low-level optimization work.
This research has been supported by an Intel-UPC Research Grant, the Spanish Ministry of Education (contract no. TIN2007-60625), and the European Union (under the HiPEAC-2 Network of Excelence, FP7/ICT 217068).
conclusions
references
To achieve high performance compilers 
should:
 1. Generate efficient SIMD code
 2. Exploit efficiently the memory hierachy
Source to source transformations help the 
compiler to generate efficient SIMD code.
A. J.C. Bik. "The Software Vectorization Handbook. Applying Multimedia Extensions for 
Maximum Performance". Intel Press. 2004. 
A. V. Aho, M. Lam; R. Sethi, J. D Ullman.  "Compilers Principles, Techniques and Tools", 
Addison Wesley 2008
D. Callahan, S. Carr, K. Kennedy. "Improving register allocation for subscripted variables". 
PLDI’1990, pp. 53-65, June 1990.
D. Nuzman, A. Zaks. “Outer-loop vectorization: revisited for short SIMD architectures.”  
PACT'2008. pp.2~11.
R. C. Whaley, A. Petitet, J. Dongarra. "Automated Empirical Optimization of Software and 
the ATLAS project".  Parallel Computing, 27(1-2):3-35, 2001.
matrix product *Scalar replacement applied to vector registers
icc does not apply














*Strip-mining already applied by the tiling transformation
register tile 
Code A: Register Tiling (BI=2, BJ=8), Scalar execution, Scalar Replacement.
register float C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16;
for (ii = 0; ii < dimi; ii+=BI)
for (jj = 0; jj < dimj; jj+=BJ)
{
C1 = C[ii*dimj + jj];C2 = C[ii*dimj + jj+1]; … C8 = C[ii*dimj + jj+7];
C9 = C[(ii+1)*dimj + jj]; C10 = C[(ii+1)*dimj + jj+1]; … C16 = C[(ii+1)*dimj + jj+7];
for (k = 0; k < dimk; k++)
{
C1 += A[ii*dimk + k]*B[k*dimj + jj];
C2 += A[ii*dimk + k]*B[k*dimj + jj+1];
…
C8 += A[ii*dimk + k]*B[k*dimj + jj+7];
C9 += A[(ii+1)*dimk + k]*B[k*dimj + jj];
C10 += A[(ii+1)*dimk + k]*B[k*dimj + jj+1];
…
C16 += A[(ii+1)*dimk + k]*B[k*dimj + jj+7];
}
C[ii*dimj + jj] = C1;C[ii*dimj + jj+1] = C2; … C[ii*dimj + jj+7] = C8;
C[(ii+1)*dimj + jj] = C9; C[(ii+1)*dimj + jj+1] = C10; ... C[(ii+1)*dimj + jj+7] = C16;
}
icc does not 
vectorize outer loops
*Also works with vectors
Code B: Register Tiling, SIMD
for (ii = 0; ii < dimi; ii+=BI)
for (jj = 0; jj < dimj; jj+=BJ)
for (k = 0; k < dimk; k++)
#pragma ivdep 
for(vj = jj; vj < jj + 4; vj++)
{  
C[ii*dimj + vj] += A[ii*dimk + k] * B[k*dimj + vj]; 
C[(ii+1)*dimj + vj] += A[(ii+1)*dimk + k] * B[k*dimj + vj]; 
C[ii*dimj + vj+4] += A[ii*dimk + k] * B[k*dimj + vj+4];
 
C[(ii+1)*dimj + vj+4] += A[(ii+1)*dimk + k] * B[k*dimj + vj+4];
 
}
VL = vector register length
icc behaviour
strip-mining to outer loops to expose 
the vector statements as inner loops
unroll & jam the strip-mined 
loop in the source code
only performs inner
loop vectorization
does not unroll strip-mined                 
vectorized loops
does not apply vector replacement* (VR) 
in unrolled loop body  we observe that icc:
we solve by applying:
example:
for (i=0; i<dimi; i+=2*VL)
for (j=0; j<dimj; j++)
#pragma vector always







identification of adjacent array 
references with pointer variables to 
expose vector register reuse
float *A1, *A2;
A1 = A; A2 = A1 + VL;












Perform the following 
transformations to matrix 
product:
- Tiling at the register level


















VR7 = 4 copies of  A[ii*dimk + k];
VR8 = 4 copies of  A[(ii+1)*dimk + k];
VR1 = VR1 + VR5*VR7;
VR2 = VR2 + VR6*VR7;
VR3 = VR3 + VR5*VR8;







Source to source transformations
using
no intrinsics no ASM no shared libraries
desired resulting



















do not fully exploit the




       exposing explicitly
the different optimizations
using pragmas and keywords
   provided by the compilers
}
float *C1, *C2, *C3, *C4;
const float *B1, *B2, *A1, *A2;
























Code C: Register tiling, SIMD, Vector Replacement
SIMDScalar
VECTORIZED REGISTER TILING
Alejandro Berna, Marta Jiménez and José M. Llabería
Departamanent d’Arquitectura de Computadors. Universitat Politècnica de Catalunya.
Barcelona, Spain. {aberna, marta, llaberia}@ac.upc.edu
for (i=0; i<dimi; i+=VL)
for (j=0; j<dimj; j++)
























C + cache tiling (BI=6, BJ=8)
B (BI=6, BJ=8)
Icc (SIMD, loop order ikj)
A (BI=6, BJ=8)
A C
B
