Using CLANG/LLVM Vectorization to Generate Mixed Precision Source Code by portalez, régis & Duguet, Florent
HAL Id: hal-02334250
https://hal.archives-ouvertes.fr/hal-02334250
Submitted on 25 Oct 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Using CLANG/LLVM Vectorization to Generate Mixed
Precision Source Code
Régis Portalez, Florent Duguet
To cite this version:
Régis Portalez, Florent Duguet. Using CLANG/LLVM Vectorization to Generate Mixed Precision
Source Code. GPU Technology Conference, Apr 2016, San Jose, United States. ￿hal-02334250￿
Using CLANG/LLVM Vectorization to Generate Mixed Precision Source Code 
Régis PORTALEZ — ALTIMESH — regis.portalez@altimesh.com 
Florent DUGUET — ALTIMESH — florent.duguet@altimesh.com 
CUDA Source — with half2 intrinsics 
At Supercomputing 2015, NVIDIA announced Jetson TX1, a mobile supercomputer, 
offering up to 1 TFLOPs of compute power for a power envelope typical of embed-
ded devices. Targeting image processing and deep learning, this platform is the 
first available to natively expose mixed precision instructions. However, the new 
mixed precision unit requires that operations on 16-bit precision floating points 
are done in pairs. Hence, approaching peak performance level requires usage of 
the half2 type which pairs two values in a single register.  
In this work, we present an approach that makes use of existing vectorization tool 
developed for CPU code optimization to further generate CUDA source code that 
uses half2 intrinsic functions, hence enabling mixed precision hardware usage with 
little effort. Using this approach, we are able to generate efficient CUDA code from 
a single scalar version of the code.   
This source to source code translation may be used in many application fields for 
different numeric types. Moreover, this approach shows very nice boundary effects 
such as better memory access pattern and instruction level parallelism. 
MIXED PRECISION DEVICES 
CLANG 
nvcc -arch sm_53 
C++ Source 
code — with 
annotations 
__device__ inline float invf(float f) { 
 const float inv29 = 1.0f / 29.0f; 
 if(f > (6.0f * inv29))   return f*f*f; 
 return 3.0f * (6.0f * inv29) * (6.0f * inv29) * (f - 4.0f * inv29); 
} 
 
#define ALIGN32 __attribute__((align_value(32))) 
 
__global__ void convert ( 
 float* ALIGN32 __restrict red, float* ALIGN32 __restrict green, float*ALIGN32 __restrict 
blue,  
 const float* ALIGN32 __restrict L, const float* ALIGN32 __restrict a, const float* ALIGN32 
__restrict b, 
 const int width, const int height, const int stride) 
{ 
 const float inv116 = 1.0f / 116.0f; 
 const float inv200 = 1.0f / 200.0f; 
 const float inv500 = 1.0f / 500.0f; 
 for (int y = threadIdx.y + blockDim.y * blockIdx.y ;  
   y < height ; y += blockDim.y * gridDim.y) { 
  for (int x = threadIdx.x + blockDim.x * blockIdx.x ;  
    x < width/8 ; x += blockDim.x * gridDim.x) 
  { 
   int index =  x + y * stride; 
   
   #pragma clang loop vectorize(enable) interleave(enable) 
   for(int u = 0; u < 8; ++u) { 
    float lL = L[8*index + u] ; 
    float la = a[8*index + u] ; 
    float lb = b[8*index + u] ; 
 
    float fy = (lL + 16.0f) * inv116 ; 
    float fx = fy + (la * inv500 ) ; 
    float fz = fy - (lb * inv200) ; 
 
    float lX = 5.6506753f * invf(fx) ; 
    float lY = 5.6506753f * invf(fy) ; 
    float lZ = 5.6506753f * invf(fz) ; 
 
    float lR = 0.4184657f * lX + (-0.15866f) * lY + (-0.0828349f) * lZ ; 
            float lG = (-0.091168f) * lX + 0.252431f * lY + 0.015707521f * lZ ; 
        float lB = 0.0009208f * lX + (-0.0025498f) * lY + 0.178598f * lZ ; 
 
    red[8*index + u] = (lR) ; 
    green[8*index + u] = (lG) ; 
    blue[8*index + u] = (lB) ; 
   } 
  } 
 } 
} Our  
Tool 
<.....................................................................> 
.L_4: 
        /*0170*/                   XMAD R0, R18.reuse, c[0x0] [0x164], R23; 
        /*0178*/                   XMAD.MRG R3, R18.reuse, c[0x0] [0x164].H1, RZ; 
        /*0188*/                   XMAD.PSL.CBCC R19, R18.H1, R3.H1, R0; 
        /*0190*/                   ISCADD R12, R19.reuse, c[0x0][0x150], 0x5; 
        /*0198*/                   LDG.128 R12, [R12]; 
        /*01a8*/                   ISCADD R8, R19, c[0x0][0x154], 0x5; 
        /*01b0*/         {         ISCADD R4, R19, c[0x0][0x158], 0x5; 
        /*01b8*/                   LDG.128 R8, [R8];        } 
        /*01c8*/         {         HADD2_32I R17, RZ.H0_H0, 0.2069091796875, 0.2069091796875; 
        /*01d0*/                   LDG.128 R4, [R4];        } 
        /*01d8*/                   HADD2 R0, R13, 16, 16; 
        /*01e8*/                   HMUL2_32I R0, R0, 0.0086212158203125, 0.0086212158203125; 
        /*01f0*/                   HADD2 R13, R14, 16, 16; 
        /*01f8*/                   HMUL2 R3, R0, R0; 
        /*0208*/                   HADD2_32I R2, R0.reuse, -0.137939453125, -0.137939453125; 
        /*0210*/                   HMUL2_32I R27, R9, 0.0020008087158203125, 0.0020008087158203125; 
        /*0218*/                   HMUL2_32I R25, R5, 0.005001068115234375, 0.005001068115234375; 
        /*0228*/                   HMUL2_32I R5, R13, 0.0086212158203125, 0.0086212158203125; 
        /*0230*/                   HMUL2 R9, R0.reuse, R3; 
        /*0238*/                   HSET2.BF.GT.AND R13, R0, R17.H0_H0, PT; 
        /*0248*/                   HMUL2_32I R3, R2, 0.12841796875, 0.12841796875; 
        /*0250*/                   HMUL2_32I R10, R10, 0.0020008087158203125, 0.0020008087158203125; 
        /*0258*/                   HMUL2_32I R6, R6, 0.005001068115234375, 0.005001068115234375; 
        /*0268*/                   HMUL2 R14, R5, R5; 
        /*0270*/                   HADD2 R16, R0, R27; 
        /*0278*/                   MOV R22, R13; 
        /*0288*/                   HADD2 R9, R9, -R3; 
        /*0290*/                   HADD2 R2, R0, -R25; 
        /*0298*/                   HADD2 R13, R5.reuse, R10; 
        /*02a8*/                   HSET2.BF.GT.AND R0, R5.reuse, R17.H0_H0, PT; 
        /*02b0*/                   HADD2_32I R10, R5.reuse, -0.137939453125, -0.137939453125; 
        /*02b8*/                   HADD2 R6, R5, -R6; 
        /*02c8*/                   HMUL2 R24, R5, R14; 
        /*02d0*/                   HFMA2 R9, R22, R9, R3; 
        /*02d8*/                   HMUL2 R5, R16.reuse, R16; 
        /*02e8*/                   HADD2_32I R3, R16, -0.137939453125, -0.137939453125; 
        /*02f0*/                   MOV R25, R0; 
        /*02f8*/                   HMUL2 R0, R13, R13; 
        /*0308*/                   HMUL2_32I R10, R10, 0.12841796875, 0.12841796875; 
        /*0310*/                   HADD2_32I R14, R13.reuse, -0.137939453125, -0.137939453125; 
        /*0318*/                   HMUL2 R5, R16.reuse, R5; 
        /*0328*/                   HSET2.BF.GT.AND R22, R16, R17.H0_H0, PT; 
        /*0330*/                   HMUL2_32I R16, R3, 0.12841796875, 0.12841796875; 
        /*0338*/                   HMUL2 R0, R13, R0; 
        /*0348*/                   HADD2 R3, R24, -R10.reuse; 
        /*0350*/                   HSET2.BF.GT.AND R13, R13, R17.H0_H0, PT; 
        /*0358*/                   HMUL2_32I R14, R14, 0.12841796875, 0.12841796875; 
        /*0368*/                   HADD2 R5, R5, -R16; 
        /*0370*/                   HFMA2 R3, R25, R3, R10; 
        /*0378*/                   MOV R24, R13; 
        /*0388*/                   HADD2 R10, R0, -R14.reuse; 
<.....................................................................> 
LLVM source  
<...................................................> 
vector.body:                                      ; preds = %for.body.6.us 
  %9 = getelementptr inbounds float, float* %L, i64 %8, !dbg !125 
  %10 = bitcast float* %9 to <8 x float>*, !dbg !125 
  %wide.load = load <8 x float>, <8 x float>* %10, align 32, !dbg !125, !tbaa !126 
  %11 = getelementptr inbounds float, float* %a, i64 %8, !dbg !128 
  %12 = bitcast float* %11 to <8 x float>*, !dbg !128 
  %wide.load137 = load <8 x float>, <8 x float>* %12, align 32, !dbg !128, !tbaa !126 
  %13 = getelementptr inbounds float, float* %b, i64 %8, !dbg !129 
  %14 = bitcast float* %13 to <8 x float>*, !dbg !129 
  %wide.load138 = load <8 x float>, <8 x float>* %14, align 32, !dbg !129, !tbaa !126 
  %15 = fadd <8 x float> %wide.load, <float 1.600000e+01, float 1.600000e+01, float 
1.600000e+01, float 1.600000e+01, float 1.600000e+01, float 1.600000e+01, float 
1.600000e+01, float 1.600000e+01>, !dbg !130 
  %16 = fmul <8 x float> %15, <float 0x3F81A7B960000000, float 0x3F81A7B960000000, 
float 0x3F81A7B960000000, float 0x3F81A7B960000000, float 0x3F81A7B960000000, float 
0x3F81A7B960000000, float 0x3F81A7B960000000, float 0x3F81A7B960000000>, !dbg !131 
<...................................................> 
  %26 = select <8 x i1> %21, <8 x float> %23, <8 x float> %25, !dbg !144 
  %27 = fmul <8 x float> %26, <float 0x40169A4AA0000000, float 0x40169A4AA0000000, 
float 0x40169A4AA0000000, float 0x40169A4AA0000000, float 0x40169A4AA0000000, float 
0x40169A4AA0000000, float 0x40169A4AA0000000, float 0x40169A4AA0000000>, !dbg !145 
  %28 = fcmp ogt <8 x float> %16, <float 0x3FCA7B9600000000, float 
0x3FCA7B9600000000, float 0x3FCA7B9600000000, float 0x3FCA7B9600000000, float 
0x3FCA7B9600000000, float 0x3FCA7B9600000000, float 0x3FCA7B9600000000, float 
0x3FCA7B9600000000>, !dbg !146 
ALIGNED LOADS / STORES — CLANG/LLVM 
identifies alignment form attributes such as 
__attribute__((align_value(x))), and trans-
forms into appropriate vector load opera-
tions. Gets transformed to a 128 bits load 
which provides best memory latency and 
bandwidth performances. 
BOOLEAN OPERATIONS —  In the code, invF 
function gets inlined. It contains a if which is 
transformed and a select llvm instruction is 
issued.  Select makes conditional assignment 
based on a mask. This instruction is trans-
formed in a mul and fma using the mask va-
lue as 0 o 1. 
VECTOR ARITHMETIC OPERATIONS — 
CLANG/LLVM generates arithmetic code 
using vector of floats, initialy aimed CPU vec-
tor units. These instructions get reused and 
transformed into instructions using half2 vec-
tor intrinsics available with mixed precision. 
The Jetson TX1 and the upcoming Pascal are capable of mixed precision. This fea-
ture aimed at deep learning and image processing offers a 16-bit floating-point 
arithmetic unit with improved performances. The way this unit is exposed [1] is 
similar to SSE instructions, that is each operation is performed on a pair of values. 
SSE instructions, and more generally SIMD instructions have been around for more 
than a decade.  
Compilers and optimizers are familiar with these instruction sets and generate 
code that makes use of those. The LLVM compiler [2] is an example of choice: in-
deed, its intermediate representation performs the vectorization stage operating 
on small vector registers [3]. Then, these vectors are translated into small vector 
instructions by the backend, depending on the target platform — ARM with NEON, 
Intel x86 with SSE or AVX, IBM Power with VSX.  
Building on the analogy between mixed precision instructions and SIMD instruc-
tions, we make use of the vectorization unit of CLANG operating on C++ together 
with LLVM to generate intermediate language  with small vectors. However, a few 
modifications have to be made: 
 
16-bits floating-point precision (a.k.a. half) is not sup-
ported on most SIMD units, hence, the input source code 
is written using 32-bit precision (float). In the transfor-
mation from LLVM intermediate to CUDA, we convert 
float32 into float16 automatically. 
 
 
The SIMD unit used by LLVM is aligned with known ar-
chitecture sizes which are at least 128 bits (SSE, NEON, 
VSX). However, mixed precision is 32 bits with two halves. 
The transformation is not a one-to-one: a kind of emula-
tion library has to be written. 
 
 
Some operations assumed to exist in the SIMD instruc-
tion set of a GPU do not exist in mixed precision, such as 
select [4] (x86 : vblend, AltiVec/VSX: vec_sel). These opera-
tions have to be written in the emulation library. 
 
As illustrated here, 128-bit aligned memory access provide best bandwidth. Hence, 
we use the 8-entries wide small vector registers from LLVM, as 8 halves have a 128
-bit memory footprint. 
In order to benefit from the 128-bit aligned loads, even using 16-bit 
floating-point precision, we configure the LLVM  optimizer to use 8-
entry wide SIMD instructions and registers. Then, we write a small 
vector emulation library which operates on a custom type half8. 
We then need to write the appropriate operators on these entries. 
 
 
We illustrate here the details of 
the select operation, most others 
being quite straightforward.  
[1]  Mixed Precision — http://docs.nvidia.com/cuda/cuda-math-api/
group__CUDA__MATH____HALF2__ARITHMETIC.html#group__CUDA__MATH____HALF2__ARITHMETIC 
[2]  LLVM Compiler infrastructure: http://llvm.org/ 
[3]  "Vector LLVA: A Virtual Vector Instruction Set for Media Processing", Robert L. Bocchino Jr. and Vikram S. Adve. 
Proceedings of the Second International Conference on Virtual Execution Environments (VEE '06), Ottawa, Cana-
da, 2006. 
[4]  LLVM select instruction : http://llvm.org/docs/LangRef.html#i-select 
[5]  L*a*b* Color Space from CIE: https://en.wikipedia.org/wiki/Lab_color_space 
APPLICATION TO IMAGE PROCESSING 
SMALL VECTOR EMULATION LIBRARY 
APPROACH 
Régis  PORTALEZ — ALTIMESH — regis.portalez@altimesh.com — Florent DUGUET — ALTIMESH— florent.duguet@altimesh.com 
With 128 bits aligned loads — 2.49 ms 
152.58 GB/s — 71% bandwidth usage 
With 32 bits loads — 4.86 ms 
97.56 GB/s — 46% bandwidth usage 
Aligned 128 bits loads improve memory bandwidth on all architectures  
here, on Maxwell (GeForce GTX 980) 
template<> 
hyb_device hyb_inline half8 select<half8, halfbool8>( 
  const halfbool8& m, const half8& l, const half8& r) { 
 half8 res ; 
 res.vh2[0] = __hsub2 (l.vh2[0], r.vh2[0]) ; 
 res.vh2[1] = __hsub2 (l.vh2[1], r.vh2[1]) ; 
 res.vh2[2] = __hsub2 (l.vh2[2], r.vh2[2]) ; 
 res.vh2[3] = __hsub2 (l.vh2[3], r.vh2[3]) ; 
 res.vh2[0] = __hfma2 (m.vh2[0], res.vh2[0], r.vh2[0]) ; 
 res.vh2[1] = __hfma2 (m.vh2[1], res.vh2[1], r.vh2[1]) ; 
 res.vh2[2] = __hfma2 (m.vh2[2], res.vh2[2], r.vh2[2]) ; 
 res.vh2[3] = __hfma2 (m.vh2[3], res.vh2[3], r.vh2[3]) ; 
 return res; 
} 
SELECT INSTRUCTION IMPLEMENTATION USING MIXED PRECISION 
When operating on images, some algorithms operate on different color spaces. We illustrate 
here the conversion between CIE-L*a*b* [5] to RGB color-space, via the CIE-XYZ color space. We 
apply this work to this color transformation. Note that the transformation calls a function that 
contains an if clause, which results in a select in LLVM intermediate language. 
Other image processing algorithms such as filters could be applied, however, the benefit of this 
approach is limited to the vectorization capability of the LLVM optimizer. 
REFERENCES 
DISCUSSION 
This approach reuses  major contributions to the compilers and optimizer research fields. Indeed, 
the data alignment and automatic vectorization including transformation of conditions into select 
instructions are very hard to identify and implement. The implementation of small vector library 
and float to half transformation is pretty straightforward and does not show major technical lock.  
While this work is motivated by the mixed-precision intrinsics library, and its SIMD nature, it can 
also be used without the final float to half transformation. Indeed, on Kepler for instance, more 
execution units than scheduling units are present, which requires the scheduler to issue two fp32 
operations in a single clock cycle to benefit from the full compute power. As a result, using a 
small vector library with several floats is a good way to offer the scheduler opportunities for ILP. 
Another significant improvement is the use of aligned 128-bit loads. Using the same input code 
compiled with CUDA, on a Maxwell architecture leads to bandwidth improvement (as illustrated 
above). When using 32-bit loads, the global read buffer gets saturated while reaching only 46% 
of memory bandwidth. A higher bandwidth usage is immediately obtained, without code change, 
automatically transforming 32-bit loads into 128-bit loads.  
(In this example, the loop is unrolled and both kernels are run with the same configuration and 
require the same number of registers). 
These performance improvements may come at some supplemental cost: operating on four or 
eight values at a time may lead to more register pressure. A heuristic needs to be found to identi-
fy whether the benefit of using 128-bit loads is not lost with register spilling. Further improving 
the float to half transformation tool including a vector-size transformation might actually help in 
this operation. However, this would require significant code analysis as loop iterations count 
would be different. 
© ALTIMESH 2016 —All rights reserved. IBM INTEL ARM CLANG LLVM NVIDIA CUDA are trademarks of their respective owners and are used here for illustration purposes only. 
3 
1 
2 
