International audienceExisting source code usually interleaves data management, error-checking, text processing and actual compute. On general purpose processors, this mixture of code tasks is not necessarily an issue, and performance levels are often satisfactory as is. However, when trying to use GPU, this hybrid computing turns into a coding challenge. Each individual computing tasks does not show sufficient workload, and porting the whole application requires a significant investment in the software asset. We propose an alternate approach with runtime compilation based on function calls on a compute library. Hybrid Vector Library operates on vectors, in a manner similar to BLAS level 1 routines, with other functions such as square root or exponential, or MKL routines. In essence, all operations are performed on a vector of values. We illustrate the performance results of this approach on a typical financial benchmark.Existing solutions such as ArrayFire do not allow custom device function to be called in the middle of a level 1 routines sequence. We address that issue by also processing these functions. We follow the call graph from the main compute routine, and generate cubin files for user-defined device functions. These functions are then linked at runtime to the hvl calls sequence, and usually generate a JCAL instruction in SASS, in a similar way to sqrt.Our approach gives similar benefits to user's code as ArrayFire, with the flexibility of custom device functions

Duguet, Florent

portalez, régis

English

Archive Ouverte en Sciences de l'Information et de la Communication

HAL Id: hal-02334252
https://hal.archives-ouvertes.fr/hal-02334252
Submitted on 25 Oct 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Hybrid Vector Library-From Memory Bound to
Compute Bound with NVVM
Régis Portalez, Florent Duguet
To cite this version:
Régis Portalez, Florent Duguet. Hybrid Vector Library-From Memory Bound to Compute Bound
with NVVM. GPU Technology Conference, May 2017, San Jose, United States. ￿hal-02334252￿
define void @hvl_nvvm_0 (i64 %n, double* %output ,  double* %param.load.37 ,  double* %param.load.38 ,  
double* %param.load.41 ,  double* %param.load.52 ,  double* %param.load.54 ) {  
entry: 
                
for.body:              
                 
    %load.idx.37 = getelementptr inbounds double* %param.load.37, i64 %idxprom  
    %hvl.37 = load double* %load.idx.37, align 8  
    %load.idx.38 = getelementptr inbounds double* %param.load.38, i64 %idxprom  
    %hvl.38 = load double* %load.idx.38, align 8  
    %hvl.36 = fdiv double %hvl.37, %hvl.38  
    %hvl.35 = call double @__nv_log ( double %hvl.36 )  
    %load.idx.41 = getelementptr inbounds double* %param.load.41, i64 %idxprom  
    %hvl.41 = load double* %load.idx.41, align 8  
    %load.idx.52 = getelementptr inbounds double* %param.load.52, i64 %idxprom  
    %hvl.52 = load double* %load.idx.52, align 8  
    %hvl.43 = fmul double %hvl.52 , 5.0000000000000000e-001  
    %hvl.42 = fmul double %hvl.43, %hvl.52  
    %hvl.40 = fadd double %hvl.41, %hvl.42  
    %load.idx.54 = getelementptr inbounds double* %param.load.54, i64 %idxprom  
    %hvl.54 = load double* %load.idx.54, align 8  
    %hvl.39 = fmul double %hvl.40, %hvl.54  
    %hvl.34 = fadd double %hvl.35, %hvl.39  
    %hvl.53 = call double @__nv_sqrt ( double %hvl.54 )  
    %hvl.51 = fmul double %hvl.52, %hvl.53  
    %hvl.33 = fdiv double %hvl.34, %hvl.51  
    %hvl.4 = call double @mycnd (double %hvl.33)  
    %hvl.2 = fmul double %hvl.37, %hvl.4  
    %hvl.28 = fmul double %hvl.41 , -1.0000000000000000e+000  
    %hvl.27 = fmul double %hvl.28, %hvl.54  
    %hvl.26 = call double @__nv_exp ( double %hvl.27 )  
    %hvl.24 = fmul double %hvl.38, %hvl.26  
    %hvl.32 = fsub double %hvl.33, %hvl.51  
    %hvl.31 = call double @mycnd (double %hvl.32)  
    %hvl.23 = fmul double %hvl.24, %hvl.31  
    %hvl.1 = fsub double %hvl.2, %hvl.23  
  %output.idx = getelementptr inbounds double* %output, i64 %idxprom     
  store double %hvl.1, double* %output.idx, align 8         
             
  %idxprom.next = add i64 %idxprom, %stepprom         
                
      
  br label %for.tail             
    
                
      
for.tail:               
    
   ....               
       
function.end:              
    
                
Hybrid Vector Library—From Memory Bound to Compute Bound with NVVM 
Régis PORTALEZ — ALTIMESH — regis.portalez@altimesh.com 
Florent DUGUET — ALTIMESH — florent.duguet@altimesh.com 
Existing source code usually interleaves data management, error-checking, text processing 
and actual compute. On general purpose processors, this mixture of code tasks is not nec-
essarily an issue, and performance levels are often satisfactory as is.  
However, when trying to use GPU, this hybrid computing turns into a coding challenge. 
Each individual computing tasks does not show sufficient workload, and porting the whole 
application requires a significant investment in the software asset.  
We propose an alternate approach with runtime compilation based on function calls on a 
compute library. Hybrid Vector Library operates on vectors, in a manner similar to BLAS lev-
el 1 routines, with other functions such as square root or exponential, or MKL routines. In 
essence, all operations are performed on a vector of values. We illustrate the performance 
results of this approach on a typical financial  benchmark. 
Existing solutions such as ArrayFire [5] do not allow custom device function to be called in 
the middle of a level 1 routines sequence. We address that issue by processing these func-
tions at compile time. 
MOTIVATION 
Similar to MKL or BLAS Level-1 routines,  Hybrid Vector Library exposes operations on vec-
tors of values. These operations include basic arithmetic operations, along with mathe-
matical function calls. It also exposes comparison tools and select operation to support 
basic value-dependent branching operations. 
The API has several implementations that can be chosen at runtime to allow maximal flexi-
bility. We illustrate here the use of two of these implementations. 
 [1] “Compiling Parallel Languages with the NVIDIA Compiler SDK”, Mark Harris, supercomputing 2012 
 [2] “LambdaJIT: a dynamic compiler for heterogeneous optimizations of STL algorithms.” Lutz, Thibaut, and Vinod 
Grover. Proceedings of the 3rd ACM SIGPLAN workshop on Functional high-performance computing. ACM, 2014 
 [3] “nvvm-IR  documentation“ : http://docs.nvidia.com/cuda/nvvm-ir-spec/index.html 
 [4] “Building GPU Compilers with libNVVM“ Yuan Lin http://on-demand.gputechconf.com/gtc/2013/presentations/
S3185-Building-GPU-Compilers-libNVVM.pdf 
 [5] “Array fire documentation” : http://arrayfire.org/docs/index.htm 
HYBRID VECTOR LIBRARY 
Régis  PORTALEZ — ALTIMESH — regis.portalez@altimesh.com — Florent DUGUET — ALTIMESH— florent.duguet@altimesh.com 
REFERENCES 
DISCUSSION 
Whether due to kernel scheduling or systematic cache miss due to split kernel calls, execution of small tasks on a 
GPU lead to significant performance penalties. As a result, chosen approach is to perform a global porting of the ap-
plication to GPU which is a tedious effort on long-lasting software assets.  
We presented here an alternate solution that result in efficient execution of a queue of small GPU tasks, leveraging 
runtime compilation to avoid the cost of a kernel launch and cache miss on the device. With a good caching strategy, 
the overall performance is 98% 
of the performance obtained 
with a hand-tuned version of the 
same algorithm.  
The utilization of the arithmetic 
pipe is above 80% on a Kepler 
K40 GPU, entering the “compute
-bound” side of implementation 
class.  
The resulting device binary is then sched-
uled for execution and results can be queried. 
The DAG is encoded into a signature in order to 
cache the compilation results — CUDA binary 
module. As shown, the compilation time may 
be longer than the overall execution time. 
When calling API methods, the operations 
are not scheduled immediately on the device. 
The different calls are gathered in a graph, 
which is by construction directed and acyclic 
(DAG), and no operation is executed until re-
sults are queried.  
RUNTIME COMPILATION 
Depending on the implementation of HVL, execution of the calculation is performed at different stages. For the basic implementation, execution is 
done upon the API call on a vector of data. When using the NVVM –backed version, intermediate results do not exist. Operations are done in four 
phases: 
C++ Application Code (1) 
// scalar code 
double BlackScholesBodyScalar( 
    double Sf, //Stock price 
    double Xf, //Option strike 
    double Tf, //Option years 
    double Rf, //Riskless rate 
    double Vf  //Volatility rate 
) 
{ 
    double S = Sf, X = Xf, T = Tf, R = Rf, V = Vf; 
 
    double sqrtT = sqrt(T); 
    double    d1 = (log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT); 
    double    d2 = d1 - V * sqrtT; 
    double CNDD1 = CND(d1); 
    double CNDD2 = CND(d2); 
 
    double expRT = exp(- R * T); 
    return (S * CNDD1 - X * expRT * CNDD2); 
} 
 
// vector code 
// earlier : mycnd has been declared extern “C“ for symbol to be retrieved 
hvlvect BlackScholesBodyVect(const hvlvect& S,  const hvlvect& X, const hvlvect& T,  
  const hvlvect& R, const hvlvect& V) 
{ 
 hvlvect VsqrtT = V * sqrt(T); 
 hvlvect    d1 = (log(S / X) + (R + 0.5 * V * V) * T) / (VsqrtT); 
 hvlvect    d2 = d1 - VsqrtT; 
 hvl_invoke(d1, mycnd);    
 hvl_invoke(d2, mycnd);   (2) 
 hvlvect expRT = exp(-R * T); 
 return (S * d1 - X * expRT * d2); 
} 
HVL API Calls  
hvl_create 
... 
hvl_assign_hybridvector 
hvl_apply_sqrt 
hvl_assign_hybridvector 
hvl_mul 
hvl_assign_hybridvector 
hvl_mul 
hvl_mul_scalar 
hvl_add 
hvl_mul 
hvl_assign_hybridvector 
hvl_div 
hvl_apply_log 
hvl_add 
hvl_div 
hvl_assign_hybridvector 
hvl_sub 
HVL_invoke 
HVL_invoke 
hvl_assign_hybridvector 
hvl_mul 
hvl_mul_scalar 
hvl_apply_exp 
hvl_assign_hybridvector 
hvl_mul 
hvl_assign_hybridvector 
hvl_mul 
hvl_mul 
hvl_sub 
hvl_destroy 
... 
At given milestones, the DAG is converted 
into NVVM source code: each node is an 
NVVM statement with a single output. The 
NVVM source code is compiled at runtime. The 
sequence of calls and the compilation result are 
cached for future usage.  
PERFORMANCE OF NAIVE IMPLEMENTATION 
The naïve implementation will perform a kernel call for each vector operation. Beyond the 
lack of compiler optimization that would for example reconstruct FMA operations, this im-
plementation suffers an important performance penalty. Indeed, each kernel call needs to 
be scheduled and executed. As illustrated in the following profiling snapshots, the execu-
tion time of a launch is about 25 microseconds 
(10µs configuration and 15µs launch). Within 
this time, about 1 million vector entries can be 
processed (calculating exp or log of the values 
for instance) 
Moreover, kernel executions are memory 
bound. Indeed, current GPUs can execute more 
than 50 FLOPS for each memory operation, 
making all simple math functions, including 
transcendentals such as exponential, memory bound. We can see performance is driven by 
memory operations and not arithmetic complexity. 
hvl_cuda_compare_scalar<le>; 741
hvl_cuda_unary<log>; 1196
hvl_cuda_unary<rcp>; 1256
hvl_cuda_binary_scalar<mul>; 1263
hvl_cuda_unary<exp>; 1267
hvl_cuda_unary<fabs>; 1267
hvl_cuda_binary_scalar<add>; 1270
hvl_cuda_unary<sqrt>; 1271
hvl_select; 1699
hvl_cuda_binary<sub>; 1782
hvl_cuda_binary<mul>; 1784
hvl_cuda_binary<add>; 1787
hvl_cuda_binary<div>; 1812
hvl_cuda_moments; 3248
0 500 1000 1500 2000 2500 3000 3500
1
Kernel Execution Time
Two 
reads 
and one 
write 
One 
read 
and one 
write 
PERFORMANCE OF RUNTIME COMPILATION OF DAG 
As we can see in this table, the runtime compilation re-
quires significantly more CPU time than the execution for 
sizes in the 100k range. 
Compilation of NVVM code takes about 50 milliseconds 
which is much higher than most execution times. A good 
caching strategy is needed.  
As a future work, we consider performing register alloca-
tion and PTX generation directly for DAG instances where 
initial cost cannot be amortized by caching strategy. 
Task Execution Time  
(micro seconds) 
Converting DAG to NVVM 97.34 
NVVM compilation to PTX 49,912.08 
PTX compilation to CUBIN 1,674.64 
CUBIN load 517.53 
Execution of same algorithm with same launch settings (120 blocks—256 threads on a Tesla K40c with CUDA 8.0 
on a variety of options count) 
User-defined device functions are identi-
fied in the call graph. CUDA source is gener-
ated for each of them, as long as a cubin file. 
The pairs function/cubin is registered at appli-
cation startup. 
1 2 3 4 
define void @hvl_nvvm_0 (i64 %n, double* %output ,  double* %param.load.37 ,  double* %param.load.38 ,  
double* %param.load.41 ,  double* %param.load.52 ,  double* %param.load.54 ) {  
entry: 
                
for.body:              
                 
    %load.idx.37 = getelementptr inbounds double* %param.load.37, i64 %idxprom  
    %hvl.37 = load double* %load.idx.37, align 8  
    %load.idx.38 = getelementptr inbounds double* %param.load.38, i64 %idxprom  
    %hvl.38 = load double* %load.idx.38, align 8  
    %hvl.36 = fdiv double %hvl.37, %hvl.38  
    %hvl.35 = call double @__nv_log ( double %hvl.36 )  
    %load.idx.41 = getelementptr inbounds double* %param.load.41, i64 %idxprom  
    %hvl.41 = load double* %load.idx.41, align 8  
    %load.idx.52 = getelementptr inbounds double* %param.load.52, i64 %idxprom  
    %hvl.52 = load double* %load.idx.52, align 8  
    %hvl.43 = fmul double %hvl.52 , 5.0000000000000000e-001  
    %hvl.42 = fmul double %hvl.43, %hvl.52  
    %hvl.40 = fadd double %hvl.41, %hvl.42  
    %load.idx.54 = getelementptr inbounds double* %param.load.54, i64 %idxprom  
    %hvl.54 = load double* %load.idx.54, align 8  
    %hvl.39 = fmul double %hvl.40, %hvl.54  
    %hvl.34 = fadd double %hvl.35, %hvl.39  
    %hvl.53 = call double @__nv_sqrt ( double %hvl.54 )  
    %hvl.51 = fmul double %hvl.52, %hvl.53  
    %hvl.33 = fdiv double %hvl.34, %hvl.51  
    %hvl.4 = call double @mycnd (double %hvl.33)  
    %hvl.2 = fmul double %hvl.37, %hvl.4  
    %hvl.28 = fmul double %hvl.41 , -1.0000000000000000e+000  
    %hvl.27 = fmul double %hvl.28, %hvl.54  
    %hvl.26 = call double @__nv_exp ( double %hvl.27 )  
    %hvl.24 = fmul double %hvl.38, %hvl.26  
    %hvl.32 = fsub double %hvl.33, %hvl.51  
    %hvl.31 = call double @mycnd (double %hvl.32)  
    %hvl.23 = fmul double %hvl.24, %hvl.31  
    %hvl.1 = fsub double %hvl.2, %hvl.23  
  %output.idx = getelementptr inbounds double* %output, i64 %idxprom     
  store double %hvl.1, double* %output.idx, align 8         
             
  %idxprom.next = add i64 %idxprom, %stepprom         
                
      
  br label %for.tail             
    
                
      
for.tail:               
    
   ....               
       
function.end:              
    
Indices management 
Indices management 
//--------------------- .nv.info.mycnd            -------------------------- 
 .section .nv.info.mycnd,"",@"SHT_CUDA_INFO" 
 .align 4 
hvl_nvvm_0: 
.text.hvl_nvvm_0: 
        /* entry block */ 
.L_23: 
        /* some compute blocks */ 
.L_16: 
 
.L_21: 
        /*0788*/                   MOV R4, R24; 
        /*0790*/                   MOV R5, R25; 
        /*0798*/                   JCAL `(mycnd); 
        /*07a8*/         {         DMUL R6, R30, -R16; 
        /*07b0*/                   SSY `(.L_22);        } 
        /*07b8*/                   DMUL R8, R6, c[0x2][0x70]; 
        /*07c8*/                   DADD R8, R8, 6.75539944105574400000e+015; 
        /*07d0*/                   DADD R12, R8, -6.75539944105574400000e+015; 
        /*07d8*/                   DFMA R10, R12, c[0x2][0x78], R6; 
        /*07e8*/                   DFMA R10, R12, c[0x2][0x80], R10; 
        /*07f0*/                   MOV32I R12, 0xfca213ea; 
        /*07f8*/                   MOV32I R13, 0x3e928af3; 
        /*0808*/                   DFMA R12, R10, c[0x2][0x88], R12; 
        /*0810*/                   DFMA R12, R10, R12, c[0x2][0x90]; 
        /*0818*/                   DFMA R12, R10, R12, c[0x2][0x98]; 
        /*0828*/                   DFMA R12, R10, R12, c[0x2][0xa0]; 
        /*0830*/                   DFMA R12, R10, R12, c[0x2][0xa8]; 
        /*0838*/                   DFMA R12, R10, R12, c[0x2][0xb0]; 
        /*0848*/                   DFMA R12, R10, R12, c[0x2][0xb8]; 
        /*0850*/                   DFMA R12, R10, R12, c[0x2][0xc0]; 
        /*0858*/                   DFMA R12, R10.reuse, R12, c[0x2][0xc8]; 
        /*0868*/                   FSETP.LT.AND P0, PT, |R7|, c[0x2][0xd0], PT; 
        /*0870*/                   DFMA R12, R10, R12, c[0x2][0x0]; 
        /*0878*/                   DMUL R22, R22, R4; 
        /*0888*/                   DFMA R10, R10, R12, c[0x2][0x0]; 
        /*0890*/                   ISCADD R17, R8, R11, 0x14; 
        /*0898*/         {         MOV R16, R10; 
        /*08a8*/               @P0 SYNC                                                    (*"TARGET= .L_22 "*);        } 
        /*08b0*/         {         MOV R0, R7; 
        /*08b8*/                   DSETP.LT.AND P0, PT, R6, RZ, PT;        } 
        /*08c8*/                   DADD R4, R6, +INF ; 
        /*08d0*/                   FSETP.GEU.AND P1, PT, |R0|, c[0x2][0xd4], PT; 
        /*08d8*/                   SEL R16, RZ, R4, P0; 
        /*08e8*/                   SEL R17, RZ, R5, P0; 
        /*08f0*/         {         MOV R4, R10; 
        /*08f8*/               @P1 SYNC                                                    (*"TARGET= .L_22 "*);        } 
        /*0908*/                   LEA.HI R0, R8.reuse, R8, RZ, 0x1; 
        /*0910*/                   SHR R5, R0, 0x1; 
        /*0918*/                   IADD R0, R8, -R5; 
        /*0928*/                   ISCADD R5, R5, R11, 0x14; 
        /*0930*/                   MOV R6, RZ; 
        /*0938*/                   ISCADD32I R7, R0, 0x3ff00000, 0x14; 
        /*0948*/                   DMUL R16, R4, R6; 
        /*0950*/                   SYNC                                                    (*"TARGET= .L_22 "*); 
.L_22: 
        /* exit block */ 
.L_71: 
NVVM Generated Code SASS Generated Code 
(1) In this implementation all calls are within the same function but it can be spread 
on multiple source files or binary modules. 
From HVL API Calls, a DAG is generated and upon re-
quest transformed into  NVVM source code.  
Vector operators are  
overloaded in C++ to 
make use of hvl library 
and include error-
checking using excep-
tions. User defined de-
vice code is invoked 
through special API 
calls. 
A cubin file is generat-
ed at runtime and 
linked with precom-
piled modules of cus-
tom device functions. 
When executing operations upon library API call, performance is memory-bound and kernel 
execution time solely depends on amount of memory read or written.  
BENEFITS OF USER-DEFINED FUNCTIONS 
In the case of complex algorithm, for example when branching cannot be converted into functions like maximum, 
the set of methods exposed in the library are not necessarily sufficient for a single source implementation. It is 
sometimes necessary to either implement kernels by hand (in which case one per architecture), or retrieve data on 
the CPU losing significant performance benefit from the approach. 
Enabling user-defined functions, it is possible for the user to write a function in a single version, and with a cus-
tomized compilation tool-chain that function can be invoked by all underlying implementations (host or device). 
Such functions are declared using supplemental attributes for the toolchain to connect between implementations. 


Hybrid Vector Library-From Memory Bound to Compute Bound with NVVM

https://hal.archives-ouvertes.fr/hal-02334252/document

Hybrid Vector Library-From Memory Bound to Compute Bound with NVVM

Abstract

Similar works

Full text

Available Versions

Archive Ouverte en Sciences de l'Information et de la Communication