As scientific frameworks become sophisticated, so do their data structures. Current data structures are no longer simple in design and they have been progressively complicated. The typical trend in designing data structures in scientific applications are basically nested data structures: pointing to a data structure within another one. Managing nested data structures on a modern heterogeneous system requires tremendous effort due to the separate memory space design.
Introduction
Energy efficiency has been at the forefront of the highperformance computing (HPC) developments to tackle energy and power consumption crisis of HPC systems [27] . A promising approach that fulfills the DARPA's requirements [27] in designing next generation of exascale supercomputers has been heterogeneity [9, 12, [15] [16] [17] 28] ; from Figure 1 . An example of a pointer chain. An illustration of a data structure and its children. In order to reach the position array, one must go through a chain of pointers to extract the effective address.
the node-level heterogeneity like Titan [22] and Summit [21] to the chip-level heterogeneity as in the System-on-a-chip architectures [1, 5] . However, developing applications for heterogeneous systems is not an easy task and requires novel approaches (e.g., directive-based programming models [6, 8, 10, 19, 20] ) to assist the application developers in their efforts. Heterogeneous systems, on the other hand, have multiple and separate levels of memory spaces, which such design requires developers to explicitly issue data transfers from one memory space to another with a set of software APIs. For instance, in a system composed of a host processor and an accelerator, the host processor cannot directly access the data on the device and vice versa. For such systems, the data is copied back and forth between the host and the accelerator. This issue becomes particularly severe for scientific applications as their data structure becomes very complicated.
As a scientific framework becomes sophisticated, so does its data structures. A data structure typically includes pointers (or dynamic arrays) that point to the primitive data types or to other user-defined data types. As a result, transferring data structures from the host to the other devices mandates not only the transfer of the main data structure but also its nested data structures. This transfer process is also known as the deep copy. The tracking of the pointers that represent the main data structure on the host from its counterpart on the device further complicates the maintenance of the data structure. Although this complicated process of performing the deep copy operation avoids a major change in the source codes, it imposes unnecessary data transfers on the application. In some cases, a selective deep copy is sufficient when only a subset of the fields of the data structure on the device is of our interest [3] . However, even though the data motion decreases proportionally, the burden to maintain data consistency among the host and other devices still exists.
Our contributions in this paper are as following:
• We will discuss the challenges of transferring a nested data structure to the device and the available options to perform the transferring. We also discuss the semantics of deep copy, the required steps to take, and the available options to perform a deep copy operation (Section 2).
• We will introduce the pointerchain directive (Section 3) as an alternative approach to tackle the challenges imposed by the pointer management of a nested data structure. Our directive reduces the amount of codes generated for the host and the device.
• We will design a set of benchmark applications to examine approaches that perform deep copy (Section 4). Our design includes two scenarios that benefit from performing deep copy; Linear and Dense scenarios.
In the Linear scenario, our targeted array is placed in depth within the nested hierarchy. In the Dense scenario, the intermediate pointers (e.g., atoms in Figure 1) are an array themselves. This will put a lot of stress on the approach that is going to be examined by our benchmark.
• We will discuss the results of our proposed scenarios (Section 6) on three separate approaches to deep copy: Unified Virtual Memory (UVM) [13] , marshalling/unmarshalling the data structure tree, and pointerchain.
Semantics of Deep Copy
Memory spaces in modern HPC platforms are categorized into two separate spaces: the host memory space and the device memory space. A memory allocation in one space does not guarantee an allocation in the other. In order to guarantee data consistency, such an approach requires a complete replication of all data structure in both spaces. However, data structures get complicated as they preserve the complex states of an application. Figure 1 illustrates a common case in the design of a data structure for scientific applications. The arrows represent pointers. The number next to each structure shows the potential physical address of an object in the memory. The main data structure is the simulation structure. Each object of this structure has pointers embedded to the other structures, in this case, the atoms structure. The atoms structure also has a pointer to another traits structure, and so Steps to perform a deep copy operation when the targeting device is a GPU. The horizontal line separates the memory spaces between the host and the GPU. (a) initialize the data structures; (b) copy the main structure to the GPU; (c) copy other nested data structures to the device; (d) fix corresponding pointers in every data structure.
on. Therefore, in order to access the elements of the positions array starting from the simulation object, we have to dereference the following chain of pointers: simulation>atoms->traits->positions. Every arrow from this chain goes through a dereference process to extract the effective address of the final pointer. We call this chain of accesses to reach the final pointer (in this case, positions) a pointer chain. Since every pointer chain eventually resolves to a memory address, we proposed the extraction of the effective address and replace it with their corresponding pointer chain in the parallel sections of the code. There are two primary techniques to efficiently utilize pointer chains within the source code. The first technique is the deep copy that requires excessive data transfer between the host and the device. The second technique is the utilization of Unified Virtual Memory (UVM) on Nvidia devices [13] . UVM provides a single coherent memory image to all processors (CPUs and GPUs) in the system, which is accessible through a common address space. UVM eliminates the necessity of explicit data transfers by applications. Although it is an effortless approach for developers, it has several drawbacks: 1) It is only supported by Nvidia devices; 2) It is not a performance-friendly approach due to its arbitrary memory transfers. The consistency protocol in UVM depends on the underlying hardware and device driver that traces memory page-faults on both host and device memories. Whenever a page fault occurs on the device, the CUDA [4, 18] driver fetches the most up-to-date version of the page from the main memory and provides it to the GPU. A page fault on the host follows similar steps to fetch the updated page from the device.
As discussed above, the scientific applications utilize nested data structures in their design. Any data structure (in C/C++) is composed of a set of simple or complex member variables. The simple member variables are those members with primitive data types (e.g., int, float, double in C/C++). However, the complex member variables are those that are user-defined data structures themselves. The situation gets complicated when the complex member variable itself possesses another complex data structure. The common approach to utilize complex member variables in C/C++ for such cases is to define them as pointers. Since the array size is not known at the compilation time, they have to be allocated at the run time. This makes their address in memory to be known only then. This is not an ideal case for heterogeneous platforms with separate memory address spaces. Figure 2 illustrates the necessary steps required to perform the deep copy. After initializing ( Step a) and transferring ( Step b and c) the structure from the host to the device, the pointers on the main structure hold illegal addresses. They still point to the same memory address on the host, which is inaccessible on the device. We have to fix this issue by reassigning the pointers to their correct corresponding addresses on the device (Step d in Figure 2 ). Deep copy, as described in [3] , can be categorized into two groups: 1) Full Deep Copy; 2) Selective (Partial) Deep Copy. A full deep copy operation copies a data structure with all of its nested data structure to the device. As a result, a replica of the whole structure is available on the device. The process discussed in Figure 2 demonstrates a full deep copy operation. However, a full deep copy is not always an appropriate approach and we need mechanisms to perform a partial copy operation. In those cases, not all variable members of a data structure are accessed during a kernel execution on the device. As a result, there is no need to transfer them to the device. Consider the example in Figure 2 . If our kernel is only accessing array x->a, we should not copy array x->b to the device and keep it on the host. This will significantly improve performance of the copy operation. This is an example of a selective deep copy operation.
Our proposed approach, which we call pointerchain [7] , is a directive-based approach that provides selective accesses to data fields of a nested data structure while minimizing error opportunities and changes to the source code. A brief description of pointerchain is provided in Section 3. For a detailed discussion, please refer to [7] .
3 The pointerchain directive A chain of pointers, similar to the example shown in Figure 1 , will be extracted to a set of machine instructions to correctly extract the effective address of the chain for both the host and the device. However, dereferencing each intermediate pointer in the chain is the equivalent of two memory load instructions, which are high cost operations. As the pointer chain lengthens with a growing number of intermediate pointers, the program have to perform excessive memory load operations to extract the final effective address that points to the final member of the chain. This extraction process impedes performance, especially when this process (dereferencing a chain of pointers) is happening within a loop (e.g., a for-loop). In order to alleviate the implications of the extraction process, we propose to perform the extraction process before the computation region begins, and then reuse the extracted address within the region afterwards.
We demonstrate the idea behind the extracting process from a pointer chain using the example in Figure 1 . In this setup, we replace the pointer chain of simulation->atoms->traits->positions with its corresponding effective address, in this case, the memory address of positions array (0xB123) as shown in Figure 1 . We utilize this address for future data transfers to and from the device and also the computational regions. It prevents transferring redundant data structures (in this case, simulation, atoms, and traits) to the device, which, in any case, will remain intact on the device. The code executed on the device will modify none of these objects. Moreover, it keeps the device busy performing "useful" work rather than spending time on extracting effective addresses from the chain.
The effective address utilization, as a replacement to a pointer chain, however, demands code modifications on both the data transfer clauses and the kernel codes. To address these concerns, we propose a set of directives that leads to minimal code changes.
Expanded Version
In its simple form, the pointerchain directive accepts two constructs: declare and region. Developers use declare construct to announce the pointer chains in their code. The syntax in C/C++ is as following:
where variable is defined as below:
variable := name type[:qualifier] where
• name: the pointer chain • type: the data type of the effective address • qualifier: an optional parameter that is either restrictconst or restrict. They will make the underlying variable to be decorated with __restrict const and __restrict in C/C++, respectively. These qualifiers provide hints to the compiler to optimize the code with regard to the effects of pointer aliasing.
After declaring the pointer chains in our code, we have to determine the code region that we target to perform the transformation. The following lines describe how to use begin and end clauses with region construct. The pointer chains that have been declared before in the current scope are the subject for transformation in subsequent regions. Our proposed directive, pointerchain, is a language-and programming-model-agnostic directive. Although, in this paper, for implementation purposes, pointerchain is targeting C/C++ and OpenACC [19] programming models, one can utilize it for the Fortran language or target the OpenMP [20] programming model as well.
Condensed Version
Our two proposed clauses (declare and region) provide developers with the flexibility of reusing multiple variables in multiple regions. However, there exists a condensed version of pointerchain that performs the declaration and replacement process at the same time. The condensed version of pointerchain replaces the declared pointer chain with its effective address in the scope of the targeted region. It is placed on the region clauses. 
.> #pragma pointerchain region end
The condensed version is a favorable choice in comparison to the declare/region pair when our kernels (regions) have a few variables and we do not reuse the chains in future. It leads to a clean, high quality code. Furthermore, utilizing the pair combination helps with the code readability, reduces the complexity of code, and expedites the porting process to OpenACC and OpenMP programming models. Potentially, the current modern compilers will be able to incorporate the condensed version of pointerchain with the OpenACC or OpenMP directives directly. The following example shows how the condensed version could be incorporated into the OpenACC programming model. 
Sample Code
Listing 1 shows an example on how to use pointerchain in a source code. Lines 1-16 show the data structures for configuration in Figure 1 , including the main object variable (simulation). Our computational kernel, Lines 25-32, initializes the position of every atom in 3D space in the system. These lines represent a normal, formal for-loop that has been parallelized by the OpenACC programming model. First, we declared our pointer chain (Line 18), then utilized the region clause to transfer the data to our target device (Lines 20-22), and finally, utilized the region clause to parallelize the for loop (Lines 25-32). Without pointerchain, parallelizing the for-loop requires to transfer every member of the chain to the device separately while retaining their relationship during the transfer. This will adversely impact the performance while making its implementation also challenging. Pointerchain is capable of dealing with both pointers and scalar variables. Unlike pointers, dealing with the scalar variables requires more attention. Following example lays out the challenges we encounter in dealing with scalar variables. Suppose we want to change the number of atoms in the atoms structure (simulation->atoms->N). The declare clause extracts the value stored in this variable and records it in a temporary variable for the future references in the upcoming regions. However, when the region is done, the temporary variable has the most up-to-date value and while its corresponding chain is unaware of such update. Therefore, pointerchain updates the corresponding pointer chain with the updated temporary variable.
Implementation Strategy
We have developed a Python script that performs a source-tosource transformation of the source codes annotated with the pointerchain directives. Our transformation script searches for all source files in the current folder, finds those annotated with the pointerchain directives, and then, transforms each pointerchain directive to its equivalent code.
Here is an overview of the transformation process. Upon encountering a declare clause, for each variable, a local variable is declared and initialized to the effective address of our corresponding pointer chain. If any qualifier is set for a chain, they will also be appended to the declaration. Any occurrences of the pointer chains in between region begin and region end clauses are replaced with their counterpart local pointers that were declareed in the same functional unit.
Methodology
In this section, we will discuss our methodology on benchmarking the deep copy operations for two different scenarios; Linear and Dense. Each scenario is tested with various transfer and layout schemes. In the following, we will discuss the detailed description of each scenario and scheme. All the source codes of our microbenchmark are accessible on Github 1 .
Linear Scenario
In the first case, we will design a set of experiments to study the effect of nesting depth on the performance of applications. Figure 3 shows the data layout for the Linear scenario. All the data structures in this scenario have similar member variables. They consist of two integer variables (nA and nLnext), a floating-point array (A), and a pointer to the next nested data structure (Lnext). The main data structure is the the data structure at level 0, which is designated with L 0 . Our design for this scenario has two parameters: k and n. The parameter k controls the depth of our data layout and the parameter n controls the length of the extra payload that we have assigned to each nested data structure.
In order to perform these experiments, we developed a Python script that accepts an integer k as an input parameter and generates a total of k C++ source files with 1 to k nested data structures, similar to the configuration in Figure 3 . The parameter n is an input to the main program of each C++ source file.
Transfer Schemes
For Linear scenarios, we have three options to transfer the data structures to the device:
1. UVM: Targeting NVidia GPUs, we utilized UVM for memory allocations. UVM allows developers to allocate memories that are accessible by both host and device. The PGI compiler provides UVM allocations with -ta=tesla:managed flag at the compile time for every memory allocation requests (mallocs) by the application. 2. Marshalling data structures: We developed a method to enable the marshalling/demarshalling of structures at the run time of the application using acc_attach/ acc_detach API methods in OpenACC. Algorithm 1 shows the steps our implementation takes to implement the marshalling. At the beginning, developers determine how big the whole tree is (the main data structure with all of its nested data structures). Then, we allocate as much memory. Afterwards, any subsequent memory allocation requests from the program are responded by returning next available space from our allocated buffer. These steps compacts all the allocated memories into a contiguous space in the memory. This approach is the ideal case for transferring a complicated data structure tree in one batch instead of multiple batches per every structure. After transferring the whole buffer to the device, we have to call acc_attach on each pointer on the device so that the pointers on the device point to a correct memory address. The demarshalling process is performed exactly in the reverse order of the marshalling algorithm. It is highly probable that the implementations of deep copy in different compilers follow similar marshalling approach. 3. pointerchain: Finally, we will investigate the effectiveness of our proposed directive as described in Section 3.
Layout Schemes
Three separate layout schemes are introduced for our Linear scenario. The layout schemes differ in whether the A arrays in Figure 3 are allocated or not, and whether they will be transferred to the device and utilized or not. n ← determineTotalBytes(struct) 3: buff ← Allocate n bytes buffer on heap 4 :
for memory allocation of size w do 6: Append the allocation request to the requestList
7:
Return a pointer to w bytes from buff Transfer buff to the device 10: for req in requestList do 11: acc_attach(req) 12: end for 13: end function 1. allinit-allused: In this scheme, all the A arrays in all levels allocate n elements and they are accessed on the GPU. Our kernel scales all elements of the A arrays with an arbitrary number. This layout scheme helps us understand the efficiency of each transfer scheme when a full deep copy is inevitable. 2. allinit-LLused: Similarly, we allocate n elements for all the A arrays, however only the A arrays of the last level is utilized within a kernel on the device. This scheme helps us understand how selective deep copy improves the performance when the kernels target only a subset of data structures on the device. 3. LLinit-LLused: In this scheme, only the A array in the last-level (L k ) allocates memory space. This scheme helps us understand which transfer scheme performs the best in a long chain of pointers. This is a dominant scheme in scientific applications like molecular dynamics simulations [7] .
Data Size
The amount of data generated by our tree of data structures for each layout scheme, as shown in Figure 3 , is as following. For the allinit-allused and allinit-LLused cases, the size of our configuration, as a function of n and k, is:
where 24 is the size of the L i structures and 8 is the size of an element in A in bytes (for double-precision floating-point numbers). For LLinit-LLused case, the data size can be computed as following:
Dense Scenario
In the dense scenario, the intermediate pointers are an array of objects instead of a single object. Figure 4 illustrates the dense scenario. This configuration provides a dense tree of data, which the size of the data will grow exponentially with small changes in both parameters in our design. The parameter q describes number of elements in the intermediate arrays L i , and the parameter n determines the number of elements in the A arrays. 
Transfer Scheme
In comparison to the Linear scenario, transferring the data structure tree represented in Figure 4 is more complicated. For marshalling and pointerchain approaches, an extra work is required to make the intermediate pointers legal on the device so that they could be derefernced correctly. In cases similar to Dense, utilizing the pointerchain directive to perform a full deep copy operation is not a viable option due to the increasing number of intermediate pointers, which grows exponentially in this case. We utilize UVM, marshalling, and pointerchain to transfer the data structure tree to the device similar to the Linear scenario. Each scheme is described in details in Section 4.1.1.
Layout Scheme
In the Dense scenario, we will choose an arbitrary index of each intermediate array L i (in our case, the last element of the array) and transfer the associated A array to the device to perform our computational kernel. For instance, for the configuration shown in Figure 4 , the kernel that we target to parallelize will look like Listing 2, where q is the number of elements in the intermediate arrays L i , and a0 is the main structure at the first level. 
Data Size
The amount of data generated by the data structure tree in the Dense scenario, as shown in Figure 4 , is very sensitive to the input parameters, q and n. Small changes in these parameters leads to significant increases in the data size. Equation 3 shows the amount of data generated in bytes for our configuration in recursive form:
DataSize(q, n, 0) = 12 + 8n (3) where 24 is the size of L i structures, 8 is the size of each element in array A, q is the length of the intermediate arrays, and D is the depth of our nested data structure. DataSize(q, n, 0) refers to the size of our last-level data structures (the L3 structures in Figure 4 ). For our experiments in this paper, we set the maximum value of D to 3. Please note that the last-level data structure is half of the original structure in size. 1-Allocate memory for whole tree structure 3: 2-Initialize the tree 4: 3-Transfer the tree to the device with a transfer scheme 5: 4-Run the kernel once 6: 5-Transfer the tree back to the host 7: 6-Check the results 8: 7-Measure the wall-clock time 9: end function
Experimental Setup
We performed our experiments on a diverse range of hardware and collected the results. Located at the University of Houston, Sabine [25] clusters host HPE compute nodes. Each systems are equipped with two Intel Xeon E5-2680v4 CPUs, with 28 logical cores, and 256GB host RAM. Sabine has both NVidia P100 and V100 GPU architectures. The P100 systems have 16GB global memory with 4MB L2 caches. The V100 GPUs also have 16GB global memory while their L2 caches are 6MB. Our software environment, for both system, include the PGI compiler 18.4.
For the Linear scenario, we developed a Python script that accepts an integer number, count, as input and generates a set of source codes in C++ for k ∈ [2, count] . Each source code is a stand-alone application. The data structure tree depicted in Figure 3 is generated statically for each k to allow the compilers apply optimizations on the source codes efficiently. For each k, our script generates nine files: three transfer schemes by three layout schemes. As an example, suppose we pass 10 to our Python script. Then, total files generated by our script is 81 ((count − 2 + 1) × 3 × 3 = 81).
For the Dense scenario, we developed three different transfer schemes (UVM, marshalling, and pointerchain) to perform the selective deep copy. Each scheme accepts two inputs, n and q, which they were previously described in Section 4.
Algorithm 2 displays the steps that each benchmark application takes. At the beginning of the application, we allocate the memory for our data structure tree. We, then, initialize them with arbitrary values. Then, we will transfer the whole data structure to the device based on the various transfer schemes explained in details in Section 4. We will run a kernel on our tree. The kernel scales every elements of the array A by a constant value. Based on the chosen layout scheme, whether it is allused or LLused, all or last-level Aarrays are scaled, respectively. After running the kernel, we will transfer the tree back to the host and check the results.
For both Linear and Dense scenarios, we will measure two different metrics: (a) the wall-clock time of the whole application, (b) the kernel execution time. The wall-clock time is measured to investigate the effect of each transfer scheme on each different scenario. The kernel execution time is measured to give us an insight about how different data layouts affect kernel's performance. Not only the execution time, but also total instructions generated by the compiler will be affected with different transfer schemes.
We used Google Benchmark [11] to measure the execution time (i.e., the kernel and the wall-clock time). It is a lightweight, powerful framework to benchmark functions. Through a set of preliminary testing, the framework learns how many iterations is required to be performed so that we get a consistent result within a low error margin at the end. Each test case is implemented as a function, and then, the whole function is benchmarked with Google Benchmark. For the results of the kernel time, we benchmarked only the kernel computations on Step 4 (line 5) of Algorithm 2.
Results
We performed our experiments that were designed in Section 4 on the Sabine systems (P100 and V100). Results are provided in this section.
Linear Scenario
We measured the wall-clock and kernel time of the experiments designed for the Linear scenario. Figure 5 shows the wall-clock time for different number of levels and different layout schemes. Results are normalized with respect to the UVM approach.
Wall-clock time
Results for the allinit-allused transfer scheme reveal how increasing parameter n leads to the performance loss for all values of k. As we increase the total size of the tree (increasing both n and k), there is no performance loss when UVM is utilized, and it has a chance to be a viable option in comparison to other methods. Furthermore, UVM is a feasible approach to transfer data between host and device when applications are dealing with huge amount of data. It provides developers more productivity with the same level of performance when we are targeting huge data. However, when n is not moderately huge (for n < 10 5 ) and the chain length (k) is small, marshalling and pointerchain outperform UVM. Furthermore, there is no subtle difference between different architectures (P100 and V100) for the allinit-allused scheme.
On the other hand, the allinit-LLused scheme is more susceptible to the transfer scheme rather than the underlying architecture. As n increases in the size, the gap between marshalling and pointerchain increases. For larger k values, pointerchain outperforms the marshalling and UVM. Thus, pointerchain is the better option for a deep copy operation in comparison to the other two options when we are dealing with huge data sets. As k increases, the marshalling scheme performs worse while the performance of pointerchain is not affected and remains constant. There is no notable difference between different architectures, and the transfer schemes determines the performance. It is the underlying data transfer medium, in our case the PCI-E bus, that determines the upper bound of the performance.
Finally, for the LLinit-LLused scheme, UVM has the worst performance results. The results show how in cases that our kernel targets an array at the last-level data structure, utilizing either marshalling or pointerchain leads to better performance results. The pointerchain scheme shows promising results when n < 10 5 . However, for n > 10 5 , the architecture design determines the winner. The V100 architecture shows 2X improvements in performance for marshalling and pointerchain schemes, however, P100 was able to show 1.25X improvement. For all values of k and n, pointerchain performed better than marshalling. Figure 6 shows the normalized kernel time with respect to UVM for different level count and different layout schemes. There is no subtle difference among different transfer schemes, different layout schemes, and different architectures. Mostly, for all values of n and k, all results follow the same trend. However, we observe the best performance when n ∈ [10 4 , 10 6 ]. Table 1 shows the total size of our data structure tree as we change k and n. For all ks, while n < 10 5 the whole data fits in the L2 cache of P100 and V100 GPUs. As we increase n, the L2 cache is not big enough anymore, which results in the mandatory cache eviction process, subsequently, we lose performance. This is the reason that we observe an increasing trend in the execution time in Figure 6 . This confirms our finding: when we are dealing with the data structures with huge sizes, there is no subtle difference in performance between UVM and other transfer schemes for complex data structures.
Kernel execution time

Dense Scenario
We measured the wall-clock and kernel time of the experiments designed for the Dense scenario. Figure 7 shows the normalized wall-clock time and kernel time with respect to UVM for different level count and different layout schemes.
Wall-clock time
The key factor that determines the performance of the whole application is the transfer scheme. The pointerchain scheme performs consistently better in comparison to the marshalling. In cases like n = 10 and n = 100, pointerchain basically shows two orders of magnitude performance improvements in comparison to marshalling. In such cases, UVM shows close to 10X improvement over marshalling.
However, as q increases, the performance gap between pointerchain and marshalling shrinks. Moreover, Figure 7 shows how in the Dense scenarios, the underlying architecture does not have any contributions to the performance. It is the transfer scheme that determines the performance. The reason behind such performance deficiency of the marshalling scheme is the extra job required to be done to ensure Table 1 . Total data size of our data structure tree as defined in the Linear scenario for the allinit-allused scheme. We used Equation 1 to calculate these numbers. One can see how the data size increases as we increase n and k. The first row is in KiloBytes, while the rest of the numbers are in MegaBytes. the pointer consistency on the device. For each pointer, we are required to fix the address in the structure to point to a correct location on the memory space of the device. Figure 7 also demonstrates the performance of the kernel with respect to different transfer schemes introduced in Section 4. Despite no subtle differences, the marshalling scheme leads to more performance friendly data layout in comparison to pointerchain on both architectures. While kernels that are executed on the marshalled data perform better than their UVM counterparts, the pointerchain scheme suffers some performance loss. Consequently, in cases that a kernel is executed multiple times on the same data, the data layout of the marshalling scheme results in a better performance. Such an effect is due to the cache friendly layout of our implementation for marshalling. The marshalling scheme places the arrays as close as possible to the pointers that points to them, however, this is not necessarily the case for the pointerchain scheme. In pointerchain, the arrays are scattered around the global memory of GPUs and they do not necessarily reside in the same memory page as the pointer itself.
Kernel execution time
Instruction Count
The process of dereferencing pointers generates a set of instruction to retrieve the effective address of the pointer. For Tesla V100, the PGI compiler generates 2 instructions per each dereference operation: 1) an instruction to load the address from global memory to a register (ld.global.nc .u64); 2) an instruction to convert the virtual address to a physical address on the device (cvta.to.global.u64). For every chain, the processor has to execute above instructions to extract the effective address. Table 3 and 4 show total number of generated instructions by the PGI compiler for the Linear and Dense scenarios, respectively. To count number of instructions, we generated the PTX files by enabling the keep flag at compile time (-ta=tesla:cc70,keep). Then, we counted number of lines (LOC) in the generated PTX file.
The results for the Linear scenario, as shown in Table 3 , reveals up to 31% reduction in the generated code for GPUs. The LOC for the LLused schemes remains constant since we are basically reducing any pointer chains in our application to one pointer. However, for UVM and marshalling schemes, as k increases, total generated code for them also increases as well since we have to dereference the chain of pointers. For the allinit-LLused and LLinit-LLused schemes, one can observe how the LOC increases by two lines between two consecutive ks. For the allinit-allused scheme, since we are dealing with multiple pointer chains, the trend is not linear, however we save more instructions in this case. Table 4 shows similar results for the Dense scenario. We have two observations: 1) The marshalling scheme did not increase number of instructions with respect to UVM. 2) pointerchain led to 25% reduction in generated instructions.
Related work
Modern HPC applications and simulation frameworks make extensive use of deeply nested data structures in their design and source code [14, 23, 24, 26] . In such cases, patterns depicted in Figure 1 happens frequently in their source code, and this requires extensive care to ensure the data consistency in a heterogeneous environment with different memory spaces. We need a deep copy of the data structures between different spaces in such environments. Deep copy has been a challenging task for the HPC developers for the past couple of years. Technical report (TR-16-1) [3] was the first attempt to formulate and propose a solution to the deep copy problem based on a real HPC application (ICON [26] ). Cray [2] proposes the utilization of policy and shape within the definition of the data structures to support both selective and full deep copy. However, it was only supported by their compiler. The PGI compiler has recently started to support deep copy in their latest compiler as a part to support a draft implementation of OpenACC 3.0. However, at the time of writing this paper, we did not have access to their latest version of the PGI compiler. The proposed solution by above-mentioned vendors are not the same and they differ in their approach.
NVidia, on the other hand, introduced its UVM technology [13] to eliminate the need for manually updating different memories of a heterogeneous system. The underlying CUDA library will track the dirty pages on the memory subsystems and provides the most up-to-date version of a memory page on the device requesting it. However, unlike deep copy approaches, UVM requests happens at arbitrary times during executing an application and causes slow down when running an kernel.
Conclusion
In this paper, we designed and implemented a benchmark suite for deep copy operations in a heterogeneous platform. We introduced two set of scenarios for different setups of developing an application: Linear and Dense scenarios. In short, Linear helps investigating the effect of a sparse data structure tree, while Dense helps studying a very dense data structure. Each scenario has a set of transfer and layout schemes. Transfer schemes determine how the whole tree of the data structure (the main structure with its nested ones) is transferred to and from the devices. The layout schemes determine how allocations are performed for each array within the data structures.
In addition to the benchmark suite, we proposed pointerchain as a low-overhead, simple directive to address the selective deep copy for nested data structures. Our results reveal how pointerchain outperforms current state-of-theart approaches. In the Dense scenarios, the pointerchain performs orders of magnitude better than UVM by NVidia.
