OpenMP is a shared memory programming model which supports the offloading of target regions to accelerators such as NVIDIA GPUs. e implementation in Clang/LLVM aims to deliver a generic GPU compilation toolchain that supports both the native CUDA C/C++ and the OpenMP device offloading models.
ere are situations where the semantics of OpenMP and those of CUDA diverge. One such example is the policy for implicitly handling local variables. In CUDA, local variables are implicitly mapped to thread local memory and thus become private to a CUDA thread. In OpenMP, due to semantics that allow the nesting of regions executed by different numbers of threads, variables need to be implicitly shared among the threads of a contention group.
In this paper we introduce a re-design of the OpenMP device data sharing infrastructure that is responsible for the implicit sharing of local variables in the Clang/LLVM toolchain. We introduce a new data sharing infrastructure that lowers implicitly shared variables to the shared memory of the GPU.
We measure the amount of shared memory used by our scheme in cases that involve scalar variables and statically allocated arrays.
e evaluation is carried out by offloading to K40 and P100 NVIDIA GPUs. For scalar variables the pressure on shared memory is relatively low, under 26% of shared memory utilization for the K40, and does not negatively impact occupancy. e limiting occupancy factor in that case is register pressure. e data sharing scheme offers the users a simple memory model for controlling the implicit allocation of device shared memory.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. LLVM-HPC'17: , Denver, CO, USA © 2017 ACM. 978-1-4503-5565-0/17/11. . . $15.00 DOI: h ps://doi.org/10.1145/3148173.3148189
CCS CONCEPTS
•LLVM OpenMP support → target region device offloading; shared memory usage; •LLVM backend → shared memory allocation;
INTRODUCTION
e increasingly wide adoption of the OpenMP 1 programming model in conjunction with heterogeneous architectures has led to support for OpenMP device offloading being integrated into all the major compilers with Clang/LLVM at the forefront of this development effort.
Although, in this paper, OpenMP device offloading only targets NVIDIA GPUs, the discussion aims to be extensible to any existing toolchain that can become a target of the OpenMP device offloading model. From an OpenMP perspective, this paper is focused on the redesign of the implicit sharing of variables across threads which is one of the the most challenging aspects of our OpenMP support with ramifications touching both correctness and performance.
e principle that has guided the development of the OpenMP device offloading support for GPUs was to reuse, as much as possible, the existing OpenMP support for the host as well as the existing device specific support in the LLVM backend. e resulting implementation is a generalization of the existing CUDA toolchain which has been extended to support the compilation of OpenMP target regions. In doing so, we aim to have our implementation automatically build on any future changes to NVIDIA devices and the OpenMP standard. e device specific backend, i.e. the NVPTX backend of LLVM, has, so far, not been involved in any of the changes to generalize the CUDA toolchain to support OpenMP target regions. e NVPTX backend contains several assumptions strongly aligned with the native CUDA C/C++ programming model and less so with OpenMP. In this paper we will address one such assumption that is strongly connected with the sharing of variables among OpenMP threads: the implicit lowering of local variables to thread-local memory.
Contributions
In this paper we make the following contributions:
• We introduce a new implicit allocation policy in the NVPTX backend of LLVM to lower implicitly shared OpenMP variables to the shared memory of the device.
• We introduce a redesign of the data sharing scheme on top of Clang trunk and libomptarget adapted to the changes to the LLVM NVPTX backend mentioned in the previous contribution.
• We evaluate the impact on the shared memory usage of the new data sharing scheme to stress test the limits of the applicability of the scheme on both NVIDIA K40 and P100 GPUs.
Background
e existing support for OpenMP target regions is built on top of the host implementation of OpenMP and is confined, almost exclusively, to the Clang frontend code generation module. e most recent code generation scheme for OpenMP target regions is detailed in [5] and is based on previous work [1, 3, 4] covering data-parallel cases [2, 6] as well as nested parallelism [5] .
Overview
In Section 2 we give an overview of previous work that involves the Clang/LLVM OpenMP device offloading toolchain. Section 3 contains a set of examples in which implicit data sharing is required by the OpenMP standard. Section 4 details the code generation practices used in the most recent version of OpenMP device offloading. In Section 5 we introduce changes to Clang, LLVM and libomptarget. e evaluation of the data sharing scheme is performed in Section 6.
RELATED WORK
Nested parallelism and data sharing across CUDA threads on NVIDIA GPUs are the main subjects of [9] . e paper introduces a CUDA language extension for nested parallelism, where CUDA kernels can contain OpenMP-like pragmas that mark when a loop can be executed in parallel. When executing, a set of master threads (e.g. one per warp) are active in the kernel in all regions outside of the ones marked as nested parallel; slave threads (e.g. remaining threads in each warp) are only activated when executing in the nested parallel regions. Under this scheme, a data sharing problem is defined when slave threads need to access data declared in the original CUDA program as local variables owned by their master thread. Unlike our contribution, this paper describes a source-tosource compiler.
e authors identify two scenarios under which data sharing is required and present compiler solutions to these. First, scalar variables declared as locals can be shared either using shi instructions, if available, or shared memory. To enable sharing from a sequential to a parallel region, the compiler inserts special sharing function calls that mask the actual implementation (shi ing or shared memory).
ree cases of sharing from a parallel to a sequential region are identified: reductions and scans, which can be implemented using shi s or shared memory depending on the parallelism scheme adopted; a special case that corresponds to the conditional modifier of lastprivate in OpenMP [7] .
is is implemented as a reduction and through pre-initialization of the base variable.
Second, static arrays declared as locals can be mapped using three strategies, namely using (i) global memory, (ii) shared memory, or by (iii) partitioning into multiple smaller local memory arrays. Local array partitioning (iii) is used when the compiler can prove that each slave will only access its assigned array partition. If that cannot be proved, then the compiler selects shared memory (ii) if the size of the local array is smaller than a pre-determined constant. If shared memory cannot be used, then the compiler fallbacks into a global memory mapping (i).
e solution presented in this paper maps all local variables onto shared memory and we describe future work in the direction of falling back into global memory. Falling back will be based on similar conclusion as those mentioned in [9] . is paper shows an implementation of data sharing within the LLVM NVPTX backend for the OpenMP language. It relies on high-level information from Clang for optimization purposes.
Our contribution and [9] rely on the presence of implicit and compiler-inserted barriers on GPUs to guarantee a consistent view of data stored in shared and global memory amongst cooperating threads. is is also the case of [8] , which describes an OpenMP implementation of a special-purpose DSP accelerator. e paper describes a so ware-based memory coherence mechanism for shared variables -no data sharing mechanism is required in this implementation for private data. is is based on introducing memory consistency operations (e.g. write-back) at appropriate OpenMP flush points, when necessary.
IMPLICIT SHARING OF VARIABLES BETWEEN OPENMP THREADS
OpenMP supports the nesting of code regions executed by different numbers of threads. ere are numerous use cases in practice where variables need to be implicitly shared among several or all the threads of a team.
OpenMP directives
roughout the paper we will use OpenMP code block to mean the user code in between two consecutive OpenMP directives which potentially alter the number of threads. Within the same execution unit launched on the device (i.e. the same kernel) the number of active threads may vary depending on which OpenMP code block is being executed: the code in between a target and a parallel may be executed by one thread while the code inside the parallel is executed by all threads. OpenMP semantics allows for an OpenMP code block to be single-threaded, fully-threaded or be executed by a user-defined number of threads.
3.1.1 target regions. A target region is a code block associated with a #pragma omp target directive. Target regions are compiled for the device specified via the -fopenmp-target=<target triple> compiler flag. In the case of NVIDIA K10 and P100 GPUs the target triple is given by nvptx64-nvidia-cuda. A target region that contains no other OpenMP directives is executed by a single thread.
A target region may contain a teams directive in which case, the two directives must be closely nested i.e. there exists no OpenMP code block in between the two directives. Apart from this case, OpenMP code blocks can occur between any other two consecutive directives that affect parallel execution: teams, distribute, parallel, for and simd.
parallel regions.
A code block associated with a #pragma omp parallel directive is executed by all available threads of a contention group unless the user specifies a custom number of threads via the thread limit clause. All threads execute the same parallel region body unless a worksharing construct such as a #pragma omp for is encountered. e team of threads executing the parallel region that encounter the worksharing construct cooperatively execute its associated code block. OpenMP allows for parallel regions to be nested further increasing the number of possible regions executed by different threads.
Implicitly shared variables
e target and parallel directives are sufficient for constructing an execution unit within which the number of active threads varies. Example is shown in Figure 1 . Figure 1a contains a single nonempty OpenMP code block where a local variable c is declared. Variable c is local to each thread and is implicitly private to each thread.
In Figure 1b the code block in between the target and the parallel directives is not empty. OpenMP semantics for target regions require that this code block be executed by a single thread. In a realworld use case the code in this code block may contain side effects which would lead to incorrect multithreaded execution.
In this la er example, variable c is local to the one thread executing the target-only region but has to be available to all threads executing the parallel construct. In the absence of clauses that would alter implicit sharing, it is essential to the correctness of the program that variable c be implicitly shared by the master thread with all the threads executing the parallel region.
In general, according to the OpenMP 4.5 specification, implicitly shared variables are defined as variables referenced inside a given construct and do not have predetermined data-sharing attributes, and are not listed in a data-sharing a ribute clause on the construct. ere are several cases:
• When these variables are inside a teams, parallel or task generating construct, the data-sharing a ributes are determined by the default clause if present. In a parallel construct, if no default clause is present, these variables are shared.
• For constructs other than task generating constructs or target constructs, if no default clause is present, these variables reference the variables with the same names that exist in the enclosing context. • In a target construct, variables that are not mapped a er applying data-mapping a ribute rules are firstprivate. In an orphaned task generating construct, if no default clause is present, formal arguments passed by reference are firstprivate.
• In a task generating construct, if no default clause is present, a variable for which the data-sharing a ribute is not determined by the rules above and that in the enclosing context is determined to be shared by all implicit tasks bound to the current team is shared.
• In a task generating construct, if no default clause is present, a variable for which the data-sharing a ribute is not determined by the rules above is firstprivate.
In this paper we will focus on the simplest example which contains a parallel construct nested inside a target region.
OPENMP CODE GENERATION IN CLANG
entry : %1 = icmp ult i32 % nvptx_tid , % thread_limit br i1 %1 , label % worker , label % mastercheck worker : call void @WORKER () br label % exit mastercheck : %2 = icmp eq i32 % nvptx_tid , % master_tid br i1 %2 , label % master , label % exit master :
call void @__kmpc_kernel_ in it ( i32 % thread_limit ) call void @__kmpc_ker ne l_ pr e pa re _ pa ra l le l () ; Code block 1 emitted here call void @llvm . nvvm . barrier0 () call void @llvm . nvvm . barrier0 () br label % exit exit : ret void In our latest publication on the Clang code generation [5] we introduce a so ware implementation of the fork-join model for the GPU. is new scheme is based on dynamic assignment of pointers to outlined functions (i.e. the actual workloads) to a pool of parallel threads. We also refer to this scheme as dynamic work allocation.
is scheme models the flexibility of the fork-join model on which the OpenMP programming language is based.
Due to the fact that we target NVIDIA GPUs, in this section and in the remainder of the paper we will use the OpenMP terminology of teams and threads as a one-to-one mapping to the CUDA model concepts of threadblocks and CUDA threads respectively. e code generation scheme manages the threads within a contention group and ensures that OpenMP semantics are respected. OpenMP supports the nesting of several parallel construct types. In this paper we focus on the simplest example for which the new code generation is employed: the nesting of a parallel region inside a sequential region. An example of this use case is shown in Figure 1b where code block 1 is executed sequentially whilst code block 2 is parallel.
e sequential region must be executed by only one thread, due to potential side effects in the user code. We also want to support cases in which functions called from within this region may contain other OpenMP parallel constructs. e la er aspect is handled by the dynamic work allocation scheme in which the master thread allocates any outstanding parallel workloads to be cooperatively executed by a group of worker threads. Work allocation entails the passing of a pointer to an outlined function containing an individual parallel workload. Target regions are compiled down to kernels launched on the GPU with a predetermined 2 number of teams and threads per team. Within a team, the threads are all uniformly launched. Within this uniform pool of threads, the code generation scheme assigns all threads in order of the thread identifier to the worker pool except for the last 32 threads. e last 32 threads are reserved for the master thread region.
e last 32 threads correspond to a full CUDA warp of threads 3 . From within the 32 threads we select the first as the master and deactivate the remaining 31. e LLVM-IR code generated for this scheme is shown in Figure 2 . Once the master thread is isolated from the rest of the threads, it executes any sequential region and assigns the workload for all the worker threads to execute.
Named barriers are used to control the execution of master and worker threads. Whilst the master executes, the workers wait at a barrier and vice-versa.
e code for the workers is shown in Figure 3 . e outlined parallel region is a function pointer passed by the master to the worker threads.
IMPLICIT SHARING OF VARIABLES IN CLANG/LLVM
In the previous section we outlined the code generation scheme employed by Clang. We consider it the baseline implementation on top of which we introduce the new data sharing infrastructure. Apart from any device specific challenges, the new data sharing infrastructure needs to deal with additional issues introduced by the previous code generation scheme: (1) the data sharing infrastructure requires the sharing of variables across different functions, i.e. from the master function to the worker function, (2) the worker function must be able to handle multiple outlined workloads and (3) each outlined workload may need to access a unique combination of implicitly shared variables.
On the device specific side the challenges are: (1) due to no communication between Clang and LLVM outside the code Clang generates, implicitly shared variables must be detected in the LLVM backend, (2) variables which need to be implicitly shared must be allocated in a shareable address space of the GPU device.
In this section we will cover changes to three different packages: Clang, LLVM and libomptarget. e main data sharing infrastructure is discussed in relation to the master-worker data sharing outlined in Section 4.
Clang code generation
To share a value across functions, we rely on the runtime to set up an array of references to the shared values. In the following section we will describe how the runtime manages this list. In this section we will focus on the changes to Clang code generation.
roughout this section we assume that the address of any shared variable can be shared among threads.
In Figure 1b we have a simple data sharing example which we will use to describe the changes to the Clang code generation scheme.
Both the master and any or all the worker threads may require read and write access to a shared variable. Making sure that the most up to date value is used, we require the sharing of a reference to this value instead. e master and worker threads can then follow this reference every time access to the variable is required for either reading or writing. Due to the way the code generation scheme in Section 4 works, there are no race conditions between master-and worker-thread accesses. Race conditions across workers are handled at user level. 3 For a detailed explanation of this choice please refer to [5] .
We first create a reference to variable c by invoking the appropriate alloca instruction. We also create the pointer to the list of shared references, shared args: define void @KERNEL ( i32 * dereferenceable (4) % c ){ entry : % c . addr = alloca i32 store i32 %c , i32 * % c . addr % shared_args = alloca i8 **
In the master only region, we invoke the runtime function kmpc kernel prepa augmented with the reference to the list of shared argument references along with the number of shared variables:
call void @__kmpc_kern e l_ pr e pa re _ pa ra l le l (... , i8 *** % shared_args , i32 1) e runtime will return a reference to a list of the desired length which the master begins to initialize with the references to the shared variables. In our example there exists only one such value that requires initialization and the following code is emi ed following the runtime call above: %17 = load i8 ** , i8 *** % shared_args %18 = getelementptr inbounds i8 *, i8 ** %17 , i64 0 %19 = bitcast i32 * % c . addr to i8 * store i8 * %19 , i8 ** %18
Once the value is set, the master-only region can access the shared variable via its reference. Any updates to the variable in the master region will therefore be visible to any worker thread that follows the same reference.
e worker function requires minor changes to handle the passing in of the shared args list. e worker function interacts with the runtime via the kmpc kernel parallel function. e function has been extended to support this. Following this call, each worker thread obtains a handle on the list of shared variables.
call i1 @__kmpc_kerne l_ pa ra l le l ( i8 ** % work_fn , i8 *** % shared_args ) e list of arguments is potentially unique to every outlined region and each worker needs to know the way in which the outlined function is called. We construct a special function called a wrapper function which passes the arguments to the parallel outlined region including any shared arguments. Each worker, instead of calling the outlined parallel region directly, will call the wrapper instead. e wrapper arguments include the list of shared arguments: %5 = load i8 ** , i8 *** % shared_args call void @WRAPPER (... , i8 ** %5) e wrapper function is shown in Figure 4 . e wrapper function controls the order of the parameters by passing them in the same order they appear in the list of shared arguments. define void @WRAPPER (... , i8 **){ entry : % c . addr = alloca i32 * %. addr2 = alloca i8 ** store i8 ** %2 , i8 *** %. addr2 , bra label % next next : %3 = load i8 ** , i8 *** %. addr2 %4 = getelementptr inbounds i8 *, i8 ** %3 , i64 0 %5 = bitcast i8 ** %4 to i32 ** %6 = load i32 *, i32 ** %5 call void @OUTLINE_PARALLEL ( i32 * null , i32 * null , i32 * %6) bra label % exit exit :
ret void } Figure 4 : Generated code for OpenMP wrapper function which passes any arguments which come from data sharing to the outlined parallel function.
libomptarget: support list of references to
shared variables e changes to the runtime include changes to the interface to accommodate the passing of the list of references to shared variables and its allocation.
e list of references to shared variables consists of a statically preallocated list in the shared memory of the device and is 20 entries in length. On the K40 and the P100 NVIDIA GPUs this leads to a shared memory footprint of 160 bytes per threadblock (or OpenMP team). Note that this list only needs to hold the references to the shared arguments so it only needs to handle the number of shared entities regardless of whether they are scalars or statically declared arrays.
e length of the list has been empirically chosen based on our limited application experience and is a conservative figure. With the increases in shared memory on newer GPU models such as the NVIDIA P100, the size of the preallocated list can be increased.
When the size of this list is insufficient, the back-up scheme is to dynamically allocate a list of variables using the malloc function. e list will therefore be allocated in the global memory of the device at the beginning of the parallel region and deallocated at the end. is back-up scheme is designed as a correctness safety-net.
e shared memory implementation on the other hand, is designed to deliver lower latency accesses. Experiments show that the different in performance between the two schemes can be as large as an order of magnitude.
Generalizing the LLVM NVPTX backend
e design principle guiding the development of the OpenMP device offloading toolchain for NVIDIA GPUs was to generalize the functionality already exposed in the CUDA toolchain of Clang/L-LVM. ere are several reasons for advocating for a more general toolchain: (1) code reuse of accelerator specific parts of the code base and (2) keeping up with any NVIDIA specific architectural changes. Tools like NVPTX will always be kept up to date with the latest CUDA releases so having that as part of the toolchain increases the long term maintainability of any OpenMP device offloading toolchain and reduces code duplication.
e new scheme in Clang code generation for implicitly shared variables described in Sections 5.1 and 5.2 relies on the addresses of the variables being shareable among threads. On NVIDIA GPUs only variables in shared or global memory can have their addresses shared across the threads of an OpenMP team.
e LLVM NVPTX backend lowers all the locally allocated variables via the alloca LLVM-IR instruction to a thread local memory stack which is emi ed at the level of the PTX code. Lowering variables to thread local memory is in line with the CUDA programming model and it is enough to satisfy its requirements. For OpenMP a more generic allocation policy of local variables is required. In the remainder of the section we discuss the lowering of local variables to the shared memory of the device for the cases required by OpenMP.
Shared memory stack.
To allow for the lowering of variables to shared memory the prologue of the output PTX kernel function is augmented with a stack allocated in the shared memory of the device:
. local . align 8 . b8 __local_depot [10] . shared . align 8 . b8 __shared_depot [10] e shared memory depot is of the same size as the local memory depot.
is allows us to reuse the local offsets within the shared stack as well. In general this is wasteful for shared memory and we aim to optimize this in future work. e evaluation of the amount of shared memory is shown in Section 6. e shared stack implementation requires a shared stack pointer. We create a special register similar to the local stack pointer which we add to the prologue of the function: mov . u64 % SPL , __local_depot mov . u64 % SPSH , __shared_depot cvta . local . u64 % SP , % SPL cvta . shared . u64 % SP , % SPSH e next step a er creating the shared memory stack is to use the shared stack pointer for those cases where a variable is shared under OpenMP semantics. e LLVM intermediate representation does not a ach any specific memory information to the allocation instruction alloca. An LLVM backend may choose to lower the alloca instruction to a any available memory types available on the device.
Detecting shared variables.
Until the addition of the shared memory stack, the NVPTX backend had only one option: implicitly mapping any alloca instruction to the thread's local memory. With the addition of the shared memory stack, the NVPTX backend needs to now choose which variables to lower to shared memory and which to lower to local memory. If we consider local memory allocation to be the default behavior, the NVPTX backend needs to contain a way to detect which variables should be lowered to the device shared memory.
In this initial implementation, the detection of shared variables is done on the basis of their address being taken. In the code generation strategy described in Section 5.1 shared variables have their address taken and stored in the array of references to shared variables. Every shared variable is therefore guaranteed to have its address taken at least once.
Checking whether the address is being taken is straightforward. We iterate through the uses of a given alloca. Whenever a reference to the allocated value is stored we assume that the variable is shared:
// Check if Ptr or an alias to it is // the destination of the store auto SI = dyn_cast < StoreInst >( Use ); if ( SI ) for ( auto Alias : PointerAliases ) if ( SI -> getValueOperand () == Alias ) return true ; // address is taken A list of aliases needs to be maintained in case the address of an alias of the original value is taken. For example, when the original value is put through a bit cast instruction.
e detection of shared variables and their lowering to shared memory needs to happen at the same time local variables are lowered to thread local memory in the LowerAlloca pass of NVPTX. We augment the pass with the means to detect whether the address of a particular variable is taken. is includes the taking of an address of any aliases of the original value returned by the alloca.
In the cases when the compiler is invoked with the -O0 flag, the LowerAlloca pass is not invoked. is means that an alternative pass needs to be create for this purpose. We call this new pass the FunctionDataSharing pass. is pass relies on the same detection strategy as the LowerAlloca pass.
Lowering alloca instructions to device shared memory.
Following the detection stage we need to lower the variables to the shared memory of the device. is can be achieved by inserting the address space cast instruction twice: to cast the variable from the generic to the shared address space immediately followed by an address space cast from the shared to the generic address space.
is will enable subsequent passes to link the original alloca with the usage of shared memory. is will also enable subsequent load and store instructions to bind to the shared version of the variable thus enabling the usage of specific instructions such as ld.shared and st.shared.
NVPTX contains several optimization passes over machine instructions. Address space casts are inserted just before the LLVM-IR code is translated into machine instructions. is ensures that the memory type information we inserted is preserved and passed down to the next abstraction level. Machine instructions are the first level of abstraction that contains frame indices. is is where a frame index is mapped to the generic, local or the newly added shared stack pointer register,
VRShared.
e VRShared register is to the shared memory stack what the VRFrameLocal register is to the local memory stack.
e lowering of frame indices to the shared index is performed in a new pass we call the LowerSharedFrameIndices pass. is pass traverses the kernel function and for each frame index we encounter, we check if the index has been translated to the shared register already. If it has not been translated, we check if the result of the operation on that frame index is converted to shared memory using one of the instructions we inserted before. An example of the machine instruction pa ern we need to identify is given by: % vreg25 < def > = LEA_ADDRi64 <fi >, 0; % vreg6 < def > = cvta_to_share d_ ye s_ 64 % vreg25 < kill >;
If the pa ern is detected, that frame index is replaced with the stack pointer of the shared frame:
e frame index is then added to a list of already translated indices. is ensures that the shared frame index is propagated appropriately to all instructions that use it.
is LowerSharedFrameIndices pass needs to occur before the StackColoring pass to ensure correctness of the stack slot coloring algorithm. If not, the algorithm may lead to the same local stack slot being used by both a local and a shared variable. is leads to the generation of incorrect code. Since the stack slot coloring algorithm works on frame indices, the earlier lowering of frame indices to the shared memory register excludes those frame indices from being considered by the algorithm.
EVALUATION
We exercise the generic code generation of OpenMP in the Clang compiler in combination with the implicit sharing of variables. In Section 6.1 we include the experimental setup, in Section 6.2 we showcase the results and in Section 6.3 we discuss the results.
Experimental setup
We test the new data sharing infrastructure on two NVIDIA GPUs, the K40 and the P100 GPU. ey each feature a shared memory area on a per-SM basis. On the K40 shared memory and L1 cache share the same 64 KB of physical memory and can be configured in three different ways: 48KB L1 + 16KB shared memory, 32KB L1 + 32KB shared memory and 16KB L1 + 48KB shared memory. In the default configuration the K40 uses 16KB/SM as shared memory.
e P100 GPU has 64 KB shared memory per SM, separate from the L1 cache.
e characteristics of the two GPUs that we are interested in are shown in Table 1 . In all experiments on the K40 we use the default split between shared memory and L1 cache: 16KB of shared memory and 48 KB of L1 cache.
We run two different test programs that exercise the implicit sharing of scalar and array variables. We evaluate the amount of shared memory in each case. We want to determine whether the shared memory is a limitation in these cases.
e two programs are shown in Figures 5 and 6 . e program in Figure 5 exercises the data sharing infrastructure in isolation.
e example only shows the case for sharing seven local variables and an array reference, requiring the sharing of eight values in total. To test the applicability of the scheme to more complex examples we test the scheme on up to 64 shared variables and note the impact on the amount of shared memory usage.
e program in Figure 6 uses arrays instead of variables. is keeps the register usage low while increasing the pressure on shared memory. We test up to four arrays of equal size to the number of worker threads.
Each OpenMP team is mapped to a CUDA threadblock. is means that anything we share across the team can be mapped to shared memory directly.
e number of teams covers the entire length of the output a array.
To be closer to setups used in practice, we fix the number of threads per team to 128. We use 96 worker threads in addition to the 32 threads used by the master warp. We use the thread limit clause to set the number of threads to 96 -the thread limit clause always sets the number of workers, the actual number of allocated threads is 32 higher than the thread limit to account for the master warp.
From our application experience, the most common choices for the number of threads per team is 128 or 256. Since shared memory is allocated on a team basis, we choose to test with 128 threads as this will lead to a higher number of teams and will be a stricter test for shared memory usage.
Experimental results
6.2.1 Kernel shared memory usage. We evaluate the shared memory usage of the programs for both K40 and P100 GPUs in Tables 2  and 3 respectively. e shared memory figures in these two tables are independent of the number of threads in a team.
e shared memory footprint is computed as the sum of: (1) the shared stack size allocated in the PTX kernel prologue, the size of which is impacted by the number of both local and shared variables (2) the statically pre-allocated shared memory list of references to shared variables which is allocated in libomptarget, (3) the thread private state which is maintained in shared memory, the size of the thread private state is affected by the combination of OpenMP directives contained by the target region.
In the case of the one implicitly shared variable there are two local variables which are used for holding two arguments passed to the kmpc kernel parallel function call from the worker function: the team number and the thread number the function is called from.
e number of pre-allocated bytes should cover the requirements for cases used in practice. We go beyond the preallocated number of variables to test the impact on the shared memory of the device. We also test the dynamic allocation employed for the list containing the references to shared variables. e dynamic allocation of this buffer is performed by libomptarget. Table 5 : Shared memory footprint of implicitly shared variables on the P100 GPU in the common use case where the target region can be executed by an arbitrary number of teams, each team comprising 128 threads.
6.2.2 Sharing scalar variables. e number of concurrent teams is affected by the number of registers required per thread and by the amount of shared memory required per team. We compute both these figures: the shared memory footprint as well as the number of registers allocated.
e results for the K40 and P100 GPUs are shown in Tables 4  and 5 respectively. In the second column we include the shared memory footprint computed in Section 6.2.1. In the third column we show the size of the dynamically allocated global memory to Table 8 : Maximum number of implicitly shared variables that can be allocated using all the available shared memory of a given SM of the K40 GPU if a certain number of concurrent teams per SM is enforced. For the number of registers per thread a thread limit of 128 threads was assumed.
hold the references to all the shared variables -note that variables are still held in shared memory. e number of registers required by each thread is shown in column four. We compute the maximum number of concurrent teams by assuming 65536 registers per SM and 128 threads per team. We multiply the number of teams with the shared memory footprint of every team in order to obtain the total shared memory volume used on an SM basis.
Sharing local arrays.
Similarly to the results in the previous section, Tables 6 and 7 measure the impact of array sharing on occupancy for the K40 and the P100 GPUs respectively.
Shared memory impact on occupancy.
We compute the number of variables that would be supported if register allocation was not a limiting factor. For each case we fix the number of concurrent teams.
is implies a maximum number of registers for each thread and also a number of variables. We consider the team size to be 128 threads. In Table 8 , for each fixed number of concurrent teams, we compute the number of variables required to use all available shared memory -16KB -on the K40 GPU.
Discussion
6.3.1 K40 GPU. In the results presented in this section we show that the footprint on shared memory for sharing scalar variables on a K40 GPU is relatively low. e number of registers allocated per thread is more of an occupancy limiter than shared memory is. e limits of our data sharing infrastructure are not reached even when a high number of concurrent teams is used since this is only possible for low number of shared variables.
e maximum shared memory footprint is achieved for 17 variables where the register usage of 42 limits the number of concurrent teams to 12 per SM, each team comprising 128 threads. e resulting shared memory footprint is 4332 bytes which is roughly 26% of the available 16KB of shared memory on the K40 GPU in its default memory configuration.
In the experiment in Section 6.2.4 we show that the maximum number of variables, either local or shared, that would reach the limit of the 16KB of shared memory on the K40 GPU, would need to lead to no more than 32 registers per thread.
e likelihood of such a large number of variables being allocated within such a small number of registers leads to a very small subset of possible kernels.
e impact on occupancy of shared memory increases in the case of array variables. Register allocation in this case is low and the maximum number of concurrent teams is therefore limited by the 16KB of shared memory on the K40 default configuration. Handling a larger volume of shared data is possible by increasing the portion of shared memory of the GPU from 16KB to 32KB. Under this configuration the memory is no longer a bo leneck and the maximum number of concurrent teams can be achieved for up to five arrays. e example in which only arrays are shared is an extreme example in the sense that no scalar variables are actually used. is means that the number of registers per thread is lower than in a practical hybrid example combining scalar and array variables.
6.3.2 P100 GPU. e P100 GPU benefits from an improved register allocation policy and a much larger shared memory area: 64 KB/SM. Our examples, due to their simplicity, do not benefit from the improved register allocation policy of the P100. e number of registers is roughly the same with that on the K40 GPU.
e analysis for sharing scalar variables is similar to that on the K40 GPU, this time the shared memory usage being even further from the device limit.
In the case of sharing array variables, the number of concurrent teams is not limited by the size of the shared memory despite the slightly lower register count.
For the P100 GPU, the maximum number of concurrent teams per SM is double that of the K40. Considering that the number of 32 bit registers per SM is the same, this leads to a maximum number of 16 registers per thread to achieve full concurrency. is leads to an even lower volume of shared memory per team.
LIMITATIONS AND FURTHER WORK
e scheme described in this paper has several limitations which we plan to address in future work.
One solution open to consideration is to dynamically allocate shared memory instead of preallocating it. is would require the compiler to be augmented with a memory model to estimate the shared memory requirements of a given kernel.
is would lead to a gain of an order of magnitude in performance for the cases in which we currently require more than the preallocated amount of shared variables.
ere are currently two main design decisions that increase the amount of shared memory being allocated per team.
ese decisions have been taken to simplify the design of the scheme and are open to optimizations:
• e shared stack being allocated in the prologue of the PTX kernel is currently of the same size as the local stack. Any offsets previously computed for the local stack can just be re-used for the shared stack. is is increasing the shared memory footprint even when shared memory does not need to be used. We therefore would like to optimize the number of shared memory slots required by a given target region by developing an appropriate offset computation.
• e detection of shared variables in the backend may end up including variables that do not need to be shared. Relying on Clang to mark the shared variables appropriately would be the more precise way of tracking which variables should be implicitly shared. is would also make it easier for the compiler to have a more precise estimate of the amount of shared memory required.
In this paper we discuss the sharing between master and worker threads which is a one-to-all pa ern. OpenMP o en requires an additional level of parallelism at worker level which may require the sharing of variables between all workers in an all-to-all pattern.
is will significantly increase the shared memory volume required by OpenMP. For such cases in which the shared memory of the device is not enough to allow for the allocation of a shared memory stack, a runtime managed global memory stack needs to be employed instead. We aim to address this in future work.
CONCLUSION
In this paper we introduce a new data sharing scheme for implicitly sharing variables in OpenMP. e new scheme involves changes to the Clang code generation and the libomptarget runtime library.
ese changes rely on the NVPTX backend of LLVM to perform the lowering of variables to the shared memory of the device for cases required by OpenMP semantics. ese changes are in line with the goal of generalizing the functionality of existing toolchains so that they can be used as targets by the device offloading capabilities of the OpenMP programming language in a maintainable way.
Despite the limitations of this scheme discussed in Section 7, we show that the shared memory volume that it requires is relatively low for the well-established NVIDIA K40 GPU and even more so for the newer NVIDIA P100 GPU.
When all we share are scalar variables, the shared memory usage is no more than 4.3 KB. e actual bo leneck in most of these cases is shown to be the high register usage. For cases where register usage is low and the volume to be shared is high (for example, when sharing statically allocated local arrays), we show that even when running on a K40, given an appropriate shared memory to L1 ratio, reaching the limit of the shared memory can be avoided.
ere are of course cases for which the shared memory of the device is not enough. For such cases a global memory implementation of data sharing will be included in future work. e experiments included in this paper, show that a model to estimate the shared memory needs of a kernel can be easily constructed for the benefit of both the users and the compiler. e relatively low shared memory footprint of this scheme ensures that, in practice, for user programs which contain a balanced number of shared array and scalar variables, the shared data can be fully contained by the shared memory of the device.
ACKNOWLEDGMENTS
e authors would like to thank scientists at U.S. DOE laboratories for their valuable input and feedback on the development process of the OpenMP compiler. We would like to thank the LLVM community of reviewers for their help in ge ing our patches upstremed.
is paper is partially supported by the CORAL project LLNS Subcontract No. B604142.
