In this paper, we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. HUM provides wrapper functions of CUDA commands and executes host-to-device memory copy commands in an asynchronous manner. We also propose two runtime techniques. One checks if it is correct to make the synchronous host-to-device memory copy command asynchronous. If not, HUM makes the host computation or the kernel computation wait until the memory copy completes. The other subdivides consecutive host-to-device memory copy commands into smaller memory copy requests and schedules the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize the overlap. We evaluate HUM using 51 applications from Parboil, Rodinia, and CUDA Code Samples and compare their performance under HUM with that of hand-optimized implementations. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to
Introduction
Heterogeneous computing uses different types of processors together to gain performance and energy efficiency. The processors include CPUs, GPUs, FPGAs, DSPs and accelerators of other types. GPU is one of the most popular accelerators and many programming models have been proposed to use it efficiently. CUDA [26] is one of the popular programming models for GPUs.
CUDA Unified Memory (UM) is a memory pool that has a single address space and can be accessed by both the host and the GPU [31] . A UM object is allocated by invoking cudaMallocManaged() in a CUDA program. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. In other words, there is no need to use cudaMemcpy() or cudaMemcpyAsync() in the CUDA program. The UM system exploits the page fault engine in the GPU [29] , and it automatically migrates accessed pages between the host and the GPU. UM significantly lessens the burden of a programmer to manage data distribution across the host and the GPU. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the kernel execution. For example, to give the user hint, the programmer manually inserts cudaMemPrefetchAsync() before the kernel is executed to prefetch memory to a specified destination GPU.
By exploiting CUDA UM and fault mechanisms in both the CPU and the GPU, overlapping data transfers and computation can be well controlled. This can be a solution to one of the major challenges in heterogeneous computing: hiding the memory transfer time between the host and the device as much as possible. In this paper, we propose a runtime, called HUM (Hidden Unified Memory), as a solution of this problem. It automatically hides the host-to-device memory copy (in short, H2Dmemcpy hereafter) time by overlapping it with host computation or kernel computation.
Here, the host computation is the execution of the host code that does not depend on H2Dmemcpy commands. It includes CPU computation, host memory allocation/deallocation, file I/O, etc. To copy data from the host memory to the device memory, CUDA provides both synchronous (blocking) commands (e.g., cudaMemcpy()) and asynchronous (non-blocking) commands (e.g.,cudaMemcpyAsync()). The asynchronous function call is asynchronous with respect to the host, hence the call may return before the copy completes while the synchronous function call returns after the copy has completed.
Overlapping H2Dmemcpy and host computation. For the best application performance, the programmer is recommended to use asynchronous memory copy commands to perform useful CPU tasks in parallel with the memory copy. However, it is difficult for the programmer to safely replace a synchronous memory copy command with an asynchronous one. By exploiting UM and the page fault engine, when a H2Dmemcpy command is synchronous, HUM makes it non-blocking (asynchronous). As a result, the H2Dmemcpy and some CPU computation are overlapped. To guarantee safety for the overlap, HUM exploits the segmentation fault mechanism in the host side at run time. When the host tries to access a page in the source host memory object of the H2Dmemcpy command that has not been copied to the device memory space yet, a segmentation fault occurs. HUM catches it and makes the host wait until the H2Dmemcpy operation on that page completes. Then, the host side computation continues.
Overlapping H2Dmemcpy and kernel computation.
While UM supports automatic overlapping of kernel execution and data transfers between the host and the device, it requires a programmer to explicitly use UM allocation commands (e.g., cudaMallocManaged()) in the source code. In addition, to fine-tune the transfers, user hints (e.g., cudaMemPrefetchAsync()) are also required. However, HUM does not require any explicit UM command and user hint. By exploiting the GPU page fault mechanism and UM, HUM automatically overlaps the H2Dmemcpy and the kernel computation without regards to if the copy command is synchronous or asynchronous. Even if the copy command is asynchronous, it is still beneficial to use HUM for performance.
To the best of our knowledge, HUM is the first work that automatically hides the H2Dmemcpy time by overlapping it with host computation or kernel computation without any explicit UM command and any modification of the source code. HUM improves the performance of traditional CUDA programs whose memory transfers are not optimized. The programmer is just required to explicitly write data transfer commands in the traditional way without any optimization. HUM automatically overlaps the memory transfers with the host computation or CUDA kernel computation. Arguably, most of the CUDA programs are still being developed in the traditional way without using UM because of performance. For new CUDA program developers, it is easier to write a traditional CUDA program without data transfer optimizations than using UM with prefetches and user hints to maximize performance.
Even though HUM targets traditional CUDA programs, it also correctly works with the CUDA programs that use UM. However, HUM does not optimize the memory transfers for the memory regions that are allocated through explicit UM allocation commands in the host program.
Major contributions of this paper are summarized as follows:
• We propose a runtime, called HUM, which exploits CUDA UM and fault mechanisms of both the host and the GPU. It automatically hides the H2Dmemcpy time by overlapping it with the host or kernel computation. We describe its design and implementation. • We propose a runtime technique that exploits the host side page protection mechanism and checks if it is correct to make a synchronous H2Dmemcpy command asynchronous. • We propose a runtime technique that subdivides consecutive H2Dmemcpy commands into smaller memory copy requests and executes the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize its overlap with the H2Dmemcpy commands. • We evaluate HUM using 51 CUDA benchmark applications from Parboil [36] , Rodinia [4] , and CUDA Code Samples [25] . The evaluation result shows that executing the applications under HUM is, on average, 1.21x faster than executing them under original CUDA. The speedup is comparable to the average speedup of 1.22 that is obtained by manually porting and optimizing the applications with Unified Memory.
the GPU memory without the need to manually copy data from one to the other. UM behaves as if the programmer had a single address space between the host and the GPU [13] . It allows a CUDA application to allocate memory objects that can be read or written from both the host and the GPU. As shown in Figure 1 (a), physical memory spaces are allocated to UM in both the host side and the GPU side. Pages in the host side space are pinned. UM page tables in the host side and the GPU side are managed by the CUDA runtime. To allocate a UM object, the CUDA program invokes cudaMallocManaged(), an allocation function that returns a pointer to the memory object. The pointer is accessible from both the host and the GPU. However, the memory object may not be physically allocated when the call to cudaMallocManaged() returns. In other words, the pages and page table entries of the memory object may not be created until it is accessed by the GPU or the CPU.
Pages in a UM object are automatically migrated between the host and the GPU on demand. This automatic page migration exploits page faults. The host reads and writes pages in the host memory and the GPU reads and writes pages in the device memory. The CUDA runtime takes care of the page migration, hence there is no need to call cudaMemcpy() or cudaMemcpyAsync() at all.
For example, suppose that a UM object has been allocated by cudaMallocManaged() and that the host has accessed two pages of the object, page 1 (at virtual address 0x3900d0000) and page 2 (at virtual address 0x3900d1000). Figure 1 (a) shows the current status of page tables and physical memory spaces of UM. Now, suppose that the GPU accesses page 2 at virtual address 0x3900d1000. Since page 2 is not residing in the GPU side, a page fault occurs and a page fault interrupt signal is raised. The page fault is handled by the NVIDIA display driver. It catches the signal and migrates the faulted page, page 2, between the host UM space and the GPU UM space as shown in Figure 1 (b). Then, it makes the GPU replay the access. To avoid excessive page faults, the NVIDIA driver uses some heuristics for the page migration [13] .
HUM exploits the same page fault mechanism to detect the case when a CUDA kernel accesses a UM page that has not been transferred from the host UM space to the GPU UM space.
Representative CUDA commands [27] used in this paper are summarized in Table 1 . The way of handling other CUDA memory management commands by HUM is similar to the way of handling those listed above.
Design and Implementation
In this section, we present the design and implementation of HUM. HUM exploits the page fault mechanism of UM to automatically overlaps host-to-device memory copy (H2Dmemcpy) (a) After the host has accessed page 1 and page 2.
(b) After the GPU has accessed page 2. cudaError_t cudaMalloc(void** devPtr, size_t size) allocates size bytes on the device and then returns in *devPtr a pointer to the allocated memory. It is a synchronous function. cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind) copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy. We are interested in cudaMemcpyHostToDevice as the value of kind in this paper. cudaError_t cudaMemcpyAsync(void* dst, const void* src, size_t count, cudaMemcpyKind kind, ...) behaves the same as cudaMemcpy() except that it is asynchronous with respect to the host. cudaError_t cudaMallocManaged(void** devPtr, size_t size, ...) allocates size bytes on the device and returns in *devPtr a pointer to the allocated memory that is automatically managed by the UM system. cudaError_t cudaMemPrefetchAsync(const void* devPtr, size_t count, int dstDevice, ...) prefetches UM memory to the specified destination device. devPtr is the base pointer of the UM memory space to be prefetched and dstDevice is the destination device. count specifies the number of bytes to prefetch. It is asynchronous with respect to the host. and host computation or H2Dmemcpy and kernel computation without any code modification.
As shown in Figure 2 , HUM consists of two components: HUM runtime and HUM driver. The NVIDIA driver[28] is a part of the CUDA framework that bridges the CUDA runtime and NVIDIA GPUs. It resides in the kernel address space. Similar to the NVIDIA driver, the HUM driver resides in the kernel address space. It intercepts signals going into the NVIDIA driver and takes some actions. Then, it calls appropriate NVIDIA driver functions for the signals if needed. The HUM runtime is a thin layer that schedules CUDA commands and offloads command execution to the original CUDA runtime. It provides wrapper functions of CUDA API functions and interacts with the HUM driver and the CUDA runtime.
In CUDA, a stream is a sequence of commands that execute in issue-order on the GPU [12] . Commands in different streams may execute out of order with respect to one another or concurrently. When a CUDA program generates a request to create a new stream, the HUM runtime creates a stream object that is a wrapper of a new CUDA stream and provides it to the CUDA program. The HUM stream object is managed by HUM, and a host thread in the HUM runtime, called the command scheduler, periodically visits all existing streams in a round-robin manner. The HUM runtime also has several worker threads. When the command at the front of each stream is ready to execute, the command scheduler takes it from the stream and dispatches it to a worker thread. The worker thread executes the command (note that the command is actually the wrapper function of a CUDA command) and enqueues the CUDA command to the CUDA stream managed by the CUDA runtime. Finally, the CUDA runtime executes the command.
Basically, a wrapper in the HUM runtime for a CUDA API function calls the original CUDA API function. For example, when the host program calls cudaGetDeviceCount(), which is a wrapper in the HUM runtime and returns the number of available devices, the wrapper calls original CUDA cudaGetDeviceCount(). Exceptions are the cases when the host program calls cudaMalloc() or cudaMemcpy(). When the host program calls cudaMalloc(), the HUM runtime allocates a memory space in the UM region by invoking CUDA cudaMallocManaged() to exploit the page fault mechanism of UM. This is why our framework is called HUM (Hidden Unified Memory). When the host program calls cudaMemcpy() for host-to-device memory copy, the wrapper in the HUM runtime invokes a custom memory copy function implemented in the HUM driver. Details are discussed in Section 3.4. 
Overlapping H2Dmemcpy and Computation
Synchronous H2Dmemcpy. Figure 3 shows some examples of the memory copy commands. In Figure 3 (a), the CUDA program allocates a host memory space, say hA, pointed to by hostA using malloc() in line 2 and a device memory space, say dA, pointed to by devA in line 3. It writes some data to hA in line 4. Then, it copies the contents of hA to dA by invoking synchronous cudaMemcpy() in line 5. After the memory copy has completed and some host computation has been performed in line 6, a kernel MyKernel that accesses dA is launched in line 7. Figure 3 (b) shows the timeline of executing the code in Figure 3 (a) under CUDA.
When the same code is executed under HUM, cudaMemcpy() returns immediately after initiating the memory copy even though the copy has not completed. This enables overlapping the memory copy in line 5 and the host computation in line 6. It may further overlaps the memory copy in line 5 and the kernel execution in line 7. Figure 3 (c) shows the timeline of executing the code in Figure 3 (a) under HUM. Compared to the timeline under CUDA in Figure 3 (b), cudaMemcpy() in line 5 is fully overlapped with the CPU computation in line 6 and partially overlapped with the kernel computation in line 7. As a result, the total execution time is significantly reduced.
Even though the kernel starts its execution before the memory copy in line 5 completes, the kernel correctly executes under HUM. The reason is that a page fault is raised at the device side when the kernel accesses a page that has not been copied yet to the device side. The page fault is handled by the HUM driver and it makes the kernel waits until the faulted page is copied to the device side. Then the page access request from the kernel is replayed.
However, if the host computation in line 6 modifies hA, the memory copy in line 5 and the host computation in line 6 may not be overlapped to guarantee data consistency and correctness. In this case, the timeline of executing the code in Figure 3 (a) under HUM is the same as that under CUDA in Figure 3 (b). The HUM runtime detects such a case using a simple runtime technique. The technique will be described later in Section 3.2.
Asynchronous H2Dmemcpy. In Figure 3 (d), the CUDA program calls asynchronous cudaMemcpyAsync() in line 5, hence the memory copy is performed in the background. As a result, the host side computation in line 6 can be overlapped with the memory copy in line 5. However, the kernel launched at line 7 cannot be overlapped with the memory copy in line 5 because all tasks placed in one stream are executed sequentially (the default behavior of CUDA). When the same code is executed under HUM, even if the asynchronous memory copy in line 5 has not finished yet, the GPU may start executing the kernel in line 7. This enables overlapping the H2Dmemcpy and the kernel computation for the same reason as the case of overlapping the synchronous H2Dmemcpy and computation mentioned above. Figure 3 (f) shows the timeline of executing the code in Figure 3 (d) under HUM. As a result, we see that the total execution time is significantly reduced.
Note that even though the HUM runtime overlaps the H2Dmemcpy and the host or kernel computation, it preserves the CUDA semantics of synchronization commands, such as cudaDeviceSynchronize(). cudaDeviceSynchronize() in the HUM runtime is also a wrapper function and invokes the original CUDA command. 01: ... 02: host_A = malloc(size); 03: cudaMalloc(&dev_A, size); 04: ... // write to host_A 05: cudaMemcpy(dev_A, host_A, size, cudaMemcpyHostToDevice); 06: ... 07: ... // write to host_A or free host_A 08: ... 09: MyKernel<<<...>>>(dev_A); 10: ... 
Data Consistency and Correctness
Consider the CUDA program in Figure 4 . After performing cudaMemcpy() to copy the contents of the memory object, say hA, pointed to by host_A to the device memory object, say dA, pointed to by dev_A in line 5, the program modifies the contents of hA or frees hA in line 7. Under the CUDA semantics, this program has no problem at all. However, it may cause a problem under HUM. The data transfer caused by cudaMemcpy() to the device may still continue when the contents of hA is modified in line 7. Thus, the device may receive some pages that contain the modified contents. As a result, the kernel may access inconsistent and incorrect data.
To solve this problem, the HUM runtime exploits the access protection of pages using a POSIX function mprotect() [9] that changes the access protection of the memory pages of the calling process. When the H2Dmemcpy caused by cudaMemcpy() or cudaMemcpyAsync() is initiated, the HUM runtime changes the protection of pages in the source host memory object to read-only. For example, the protection of the pages in the object pointed to by host_A in Figure 4 is changed to read-only when the H2Dmemcpy of cudaMemcpy() is initiated.
After copying all pages in the source host memory object finishes, the HUM runtime tries to restore the protection. To do this, the HUM runtime first looks up all scheduled host-to-device memory copy commands and checks if there are commands that have an overlapping range of source addresses. If not, the HUM runtime restores the protection. Otherwise, the HUM runtime restores the protection of nonoverlapping source memory regions only. The protection of overlapping source memory regions will be restored later when the following scheduled memory copy command finishes.
When the CUDA program in Figure 4 modifies a page in hA in line 7 in a manner (e.g., write) that violates the protection, the linux kernel generates a SIGSEGV signal. The signal handler installed by the HUM runtime handles the signal. When it receives the signal, it waits until all H2Dmemcpy commands for the page complete and the protection is restored. This method allows the HUM runtime to execute H2Dmemcpy commands in an asynchronous manner without any data consistency violation or any segmentation fault.
HUM Driver
Intercepting interrupts. To overlap H2Dmemcpy and kernel execution, HUM makes the GPU pend when the page accessed by the GPU has not been transferred to the GPU yet. In this case, a GPU page fault occurs in HUM. The HUM driver handles the page fault. The HUM driver hooks the interrupt handler of the NVIDIA display driver and intercepts the page fault signal. In Linux for the x86 architecture, the interrupt descriptor table (IDT) contains all information about interrupts, such as interrupt number, interrupt name, address of the interrupt handler, interrupt flags, etc. When the HUM driver is installed, HUM looks up the existing IDT entries and finds the entry for the NVIDIA interrupt handler. HUM replaces the entry with the information of its own interrupt handler. Handling page faults. Figure 5 shows the actions occurring when the HUM interrupt handler handles a page fault in the GPU side. When the HUM interrupt handler receives an interrupt signal, it checks the fault buffer in the GPU if there is a pending GPU page fault. The fault buffer is a circular queue implemented in the GPU by NVIDIA. It stores page faults information from the GPU. If there is no pending fault in the fault buffer, the HUM interrupt handler invokes the original NVIDIA interrupt handler because the interrupt is not a page fault and there is nothing to do for the HUM interrupt handler. Otherwise, it checks whether faulted pages are allocated through cudaMalloc() call or cudaMallocManaged() call in the host program.
For pages that are allocated through cudaMalloc() call, the HUM driver handles the fault. The HUM interrupt handler waits until all the faulted pages arrive and are mapped to the GPU. Then, the HUM driver sends a replay signal to the GPU so that the GPU replays the faulted memory accesses. For pages that are allocated through cudaMallocManaged() call, the HUM driver calls the fault handling routine in the CUDA display driver. Thus, the programmer can use both cudaMalloc() and cudaMallocManaged() in their host program without causing any problem. However, HUM does not optimize the memory transfers for the memory regions that are allocated through cudaMallocManaged() calls in the host program.
HUM H2Dmemcpy Mechanism
When the GPU accesses a page that has not been copied from the host side to the GPU side, the HUM runtime makes the GPU waits until the page arrives. As a result, a kernel can be executed even the transfer of the data to be accessed by the kernel is still ongoing. However, to implement the H2Dmemcpy in HUM, we may not use cudaMemcpy() and cudaMemcpyAsync() because they cause a serious interrupt handling problem.
Problems of CUDA memory copy commands. For example, suppose that the HUM runtime uses cudaMemcpy() to copy data from the host to the device and that the GPU is trying to read a page that has not been yet copied to the GPU side. Then, a read page fault is raised and the HUM driver catches it. The HUM driver waits until the page comes to the GPU side. Calling cudaMemcpy() triggers write page fault when the page arrives at the GPU because the page has not been mapped to the GPU yet. The HUM driver catches the write page fault and maps a blank page to the GPU UM space. Then, it sends a replay signal to the GPU. This makes the GPU reads stale data in the blank page. In turn, the page arrived updates the GPU UM space. Since interrupts caused by memory requests are processed sequentially one by one in the GPU, the kernel reads the stale data in the blank page first, and the page update by the memory copy follows this read. To get the correct result, the memory copy should have completed before the kernel reads the stale page. However, changing the order of interrupt processing is not supported by the current NVIDIA driver.
HUM H2Dmemcpy functions.
To solve this problem, the HUM driver has its own H2Dmemcpy function. Figure 6 shows how the HUM H2Dmemcpy function works. A CUDA program first writes data to the host memory space that is generally allocated through malloc() ( 1 ) . Suppose that the program uses cudaMemcpy() or cudaMemcpyAsync() to perform the H2Dmemcpy. As mentioned before, the HUM runtime implements wrappers of cudaMemcpy() and cudaMemcpyAsync(). In the wrappers, the HUM runtime calls the HUM driver rather than calling the original CUDA cudaMemcpy() or cudaMemcpyAsync().
The HUM driver first copies the data from the host memory space to the host UM space ( 2 in Figure 6 ). This makes pages and page table entries of the memory object to be created on the host side, effectively increasing the required memory space on the host side as much as the size of the host UM space. Then, it invokes the page migration function provided by the NVIDIA driver to migrate the pages in the host UM space to the GPU UM space ( 3 ). To use the migration function, source pages of the migration must reside in the host UM space. The page migration function is synchronous and migrates maximum 512 pages at a time, i.e., maximum 2 MB at a time. When the migration completes, the pages are mapped to the GPU, and the GPU can access the pages without any page fault ( 4 ) .
When there is a H2Dmemcpy request of size M MB (M > 2), the HUM driver divides the request into multiple requests of size 2 MB. We take the maximum size because frequent memory-copy requests cause heavy copy initiation overhead. 01: ... 02: host_A = malloc(size); 03: host_B = malloc(size); 04: host_C = malloc(size); 05: ... // write to host_A and host_B 06: cudaMalloc(&dev_A, size); 07: cudaMalloc(&dev_B, size); 08: cudaMalloc(&dev_C, size); 09: ... 10 
Parallelizing Memory Copy Commands
Consider a vector addition CUDA program in Figure 7 . It adds two vectors A and B, and the result is stored in vector C. Figure 8 shows timelines of executing the program. Since the memory copy command and the kernel execution command are issued in the same stream to guarantee correctness, they are sequentially executed as shown in Figure 8 (a) under CUDA semantics.
The timeline of executing the vector addition program under the HUM design discussed so far is shown in Figure 8(b) . HUM may execute the kernel as early as possible when the memory copy for vector B has initiated. As a result, the time when the kernel completes under HUM maybe much earlier than that under normal CUDA. Since the page migration function provided by NVIDIA driver used in the HUM H2Dmemcpy function is synchronous, the memory copy for vector B has to be initiated after the memory copy for vector A has completed.
Using the HUM H2Dmemcpy function, the time spent on memory copying is much larger than using cudaMemcpy() or cudaMemcpyAsync(). This is because HUM copies the data twice: from the host memory space to the host UM space, and then to the GPU UM space.
To reduce the copy time from the host memory to the host UM space ( 2 in Figure 6 ), HUM exploits multiple host threads for the memory copy. The multiple threads simultaneously copy different parts of the source host memory to the host UM space. HUM divides the source host memory object into multiple 2MB chunks and each thread takes care of copying a 2MB memory chunk to the host UM space at a time. While this approach reduces the copy time from the host memory to the host UM space, it may result in interference with other CPU threads depending on the application.
Scheduling Memory Copy Commands
When more than one CUDA H2Dmemcpy commands are issued consecutively from a CUDA program, the HUM runtime copies their divided 2MB chunks from the host UM space to the device UM space in a round-robin manner. In the HUM runtime, there is a pool of page migration queues (PMQs) to queue the page migration requests of 2MB chunks. Moreover, there exists a different PMQ for each CUDA H2Dmemcpy command issued.
For a H2Dmemcpy command from the CUDA program, after dividing the source host memory object into 2MB chunks and copying them to the host UM space with multiple threads, the page migration request of each chunk from the host UM space to the GPU UM space is inserted in the associated PMQ. A host thread called the page migration thread (PMT) is taking care of visiting non-empty PMQs in the pool in a round-robin manner. The PMT processes the page migration request at the head of each PMQ by calling the page migration function provided by the NVIDIA driver.
In this case, there must not exist any dependence between destination locations of the consecutively issued CUDA H2Dmemcpy commands. Since the HUM runtime has all information about the CUDA H2Dmemcpy commands issued from a CUDA program, it performs a simple and conservative address range overlapping check between the destinations of memory copy commands. Note that at run time, the real addresses are known. When CUDA H2Dmemcpy is enqueued to the HUM runtime, the runtime checks if there is an in-flight memory copy command that has an overlapping range of destination addresses with the enqueued command. If there is no overlapping, the HUM runtime schedules the memory copy command normally. Otherwise, it pends scheduling the memory copy command until the in-flight memory copy command finishes. However, such dependences are hardly found in real applications (none in our benchmark applications).
By doing so, we can schedule the kernel launch as early as possible. As a result, the kernel may access required pages sooner and its execution may finish earlier. This case is illustrated in Figure 8 (c). The kernel execution can be initiated after the execution of the H2Dmemcpy command of the vector B has been initiated. In general, with regards to H2Dmemcpy commands, the execution of a kernel command K under HUM can be initiated as early as possible at the time point that satisfies all of the following conditions:
• The last command preceding K in the same stream is a CUDA H2Dmemcpy command, say C, on which K's arguments depend. • The execution of C has been initiated.
• All target pages of C in the device UM space have been unmapped once to the GPU after the initiation of executing C. 
Evaluation
In this section, we evaluate HUM with various GPU applications and analyze the results. We compare the performance of HUM with that of manual optimizations.
Methodology
System configuration. We use NVIDIA Tesla V100 (Volta architecture [24] ), Tesla P100 (Pascal architecture [29] ), and GeForce RTX 2080 (Turing architecture [30] ) GPUs for our experiment. Detailed system configuration is summarized in Table 2 .
Benchmark applications. We use 51 applications from various sources: 11 applications from Parboil [36] , 22 applications from Rodinia [4] , and 18 applications from CUDA Code Samples [25] . While we use all the applications from Parboil and Rodinia, we choose only 18 out of 170 applications in CUDA Code Samples. We exclude 152 applications in CUDA Code Samples because of the following reasons:
• They use CUDA graphics or driver API, • They have neither CUDA kernel execution nor H2Dmemcpy, • They use additional CUDA libraries (cuBLAS, cuFFT, cuSPARSE, cuSOLVER, nvGRAPH), • They appear in Parboil or Rodinia, or • Their kernel execution times are too small (less than 1ms) to see the effect of overlapping H2Dmemcpy and CUDA kernel computation.
We exclude applications that use additional CUDA libraries because the HUM runtime has to provide wrapper functions for all library functions used in the applications. These applications mainly focus on showing the functionality of the libraries. While there is no technical difficulty to implement the wrapper functions, we decide not to include those applications because it is too time-consuming. Figure 9 shows the characteristics of applications in each benchmark suites. The column Sync or async shows the type of H2Dmemcpy commands each application uses. The column CPU/H2D overlap shows if the application is designed to overlap CPU computation and H2Dmemcpy. The column CPU/H2D overlap (HUM) shows if HUM can overlap the CPU computation and the H2Dmemcpy.
Most of the applications use synchronous H2Dmemcpy and hence, they are unable to overlap CPU computation and H2Dmemcpy when running under normal CUDA. On the other hand, HUM can overlap the CPU computation and the H2Dmemcpy in most of the cases except some applications that modify the contents of the source host memory object of the H2Dmemcpy or frees it after the H2Dmemcpy (lbm and tpacf in Parboil, cfd and leukocyte in Rodinia).
Suite
No. Name Sync or async CPU/H2D overlap CPU/H2D overlap (HUM) Figure 9 . Characteristics of applications.
We use the largest dataset that fits in the GPU memory for each application, hence, most of the datasets used for the experiment are hundreds of megabytes to a few gigabytes. As the goal of HUM is performance improvement without any code modification, no source code of the applications is modified.
Results
Speedup on V100. Figure 10 shows the speedup of each application with various optimization schemes on a single V100 GPU. The speedup is obtained over running the original version of each application (this setup is called CUDA hereafter). The optimization schemes are described as follows:
• CUDA-async is a manually optimized version where synchronous memory copy functions in the original application is transformed to corresponding asynchronous ones when the transformation is safe. • CUDA-UM is a naive UM implementation of each application. We change all cudaMalloc() functions to cudaMallocManaged(). Then, we remove all CUDA memory copy functions, such as cudaMemcpy() and cudaMemcpyAsync(), because data will be automatically transferred between the host and the device by CUDA UM. • CUDA-UM-opt is a manually optimized version of CUDA-UM using user hints (e.g., cudaMemPrefetchAsync() and cudaMemAdvise()).
We add cudaMemPrefetchAsync() as early as possible before the CUDA kernel launch so that memory copy and kernel computation can be overlapped. cudaMemPrefetchAsync() is also used to map blank pages to the GPU if the pages are first accessed for write by the GPU. This prevents excessive write page faults in the GPU side. We add cudaMemAdvise() to avoid page migration if the pages are read by both the CPU and the GPU without any write (i.e., read-only accesses). • HUM-no-sched runs the applications under HUM without any H2Dmemcpy command scheduling described in Section 3.6. • HUM runs the applications under HUM with all the HUM techniques described in Section 3.
The number of memory-copy threads mentioned in Section 3.6 is set to eight in both HUM-no-sched and HUM. On the other hand, some applications show very good speedup under HUM and CUDA-UM-opt. The applications sgemm and spmv in Parboil, b+tree, hybridsort, and leukocyte in Rodinia have enough kernel computation time to hide the H2Dmemcpy time. The application huffman in Rodinia Figure 10 . Speedup of each application with a single V100 GPU.
mainly benefits from overlapping the H2Dmemcpy and the host computation. CUDA-UM-opt is much better than HUM in BlackScholes, vectorAdd, and warpAggergatedgAtomicsCG in CUDA Code Samples. This is due to the prefetching heuristics used in the NVIDIA driver for page migration. When a GPU page fault occurs, the NVIDIA driver actively prefetches some pages around the faulted page from the host UM space to the GPU UM space according to the prefetching heuristics (note that the heuristics are not publicly known).
CUDA-async is a little bit better than CUDA for Parboil on average, but there is no difference between CUDA-async and CUDA for Rodinia and CUDA Code Samples on average. This is because few applications in Rodinia and CUDA Code Samples have some host computation to hide between the H2Dmemcpy command and the kernel launch command.
CUDA-UM is a little bit better, on average, than CUDAasync for Parboil and Rodinia because of the prefetching heuristics used in the NVIDIA driver for the Unified Memory. CUDA-UM is much worse than CUDA-async for CUDA Code Samples on average because of SobolQRNG. In SobolQRNG, CUDA-UM is 88 times slower than CUDA-async. The 4GB write-only data accessed by the kernel in SobolQRNG incur a lot of page faults in the GPU side for CUDA-UM. This does not happen for CUDA-UM-opt because to avoid the write page faults, CUDA-UM-opt maps the data pages to the GPU using cudaMemPrefetchAsync() before the kernel execution is initiated.
While CUDA-UM-opt achieves the average speedup of 1.22x for all applications, the average speedup of HUM is 1.21x (1.20x for Parboil, 1.26x for Rodinia, and 1.13x for CUDA Code Samples). Thus, the speedup under HUM is comparable to that of CUDA-UM-opt.
Speedup on P100 and RTX 2080. We also evaluate HUM using an NVIDIA P100 GPU and an NVIDIA RTX 2080 GPU. Figure 11 and Figure 12 show the speedup of each application with various optimization schemes on P100 and RTX 2080, respectively. The trend in these results are similar to the result of V100. On P100, CUDA-UM-opt achieves an average speedup of 1.12x and HUM achieves an average speedup of 1.11x over CUDA. On RTX 2080, CUDA-UM-opt achieves an average speedup of 1.13x and HUM achieves an average speedup of 1.12x over CUDA. This indicates that HUM also achieves comparable performance to CUDA-UMopt on both P100 and RTX 2080.
Effect of H2Dmemcpy command scheduling. On V100, HUM-no-sched is slower than HUM consistently. Even HUM-no-sched is slower than CUDA for some applications. One such a case is when the memory-copy time dominates the execution time. When the kernel computation time is not large enough, overlapping the H2Dmemcpy and the kernel computation cannot fully amortize the slowdown in H2Dmemcpy due to copying the memory object twice from the source host memory space to the host UM space, and then from the host UM space to the device UM space. tpacf in Parboil, nn, pathfinder, and srad in Rodinia, BlackScholes, mergeSort, scalarProd, and threadFenceReduction in CUDA Code Samples fall in this category. Another case is when the CUDA kernel launch is not scheduled as early as possible. spmv in Parboil, b+tree in Rodinia, matrixMul and vectorAdd in CUDA Code Samples fall in this category. For example, as mentioned in Section 3.6, in vectorAdd, there are two memory objects to transfer from the host to the device (vector A and vector B) . Without the memory-copy command scheduling, the kernel execution cannot be scheduled until the entire vector A has been copied to the device. The number of memory-copy threads. As mentioned in Section 3.5, HUM uses multiple threads to copy the source host memory object to the host UM space to execute a H2Dmemcpy command. To find the optimal number of threads, we vary the number of memory-copy threads from 1 to 16 and measure the overall performance. Here, we divide the source host memory object in N chunks when there are N memory-copy threads. Each thread takes care of a chunk of the source host memory object and copies the chunk simultaneously. Figure 13 shows the average speedup obtained over one thread for each benchmark suite on V100. We see that, on average, eight is the optimal number of memorycopy threads.
Multi-GPU environments. To show that HUM works well with multi-GPU environments, we choose the applications whose speedup under HUM with a single GPU is greater than 1.10 and whose workload can be easily distributed across multiple GPUs. These applications include sgemm in Parboil, and matrixMul, MC_SingleAsianOptionP, and vectorAdd in CUDA Code Samples. We implement the multi-GPU version of them. In addition, we choose MonteCarloMultiGPU in CUDA Code Samples because it is originally designed to support multiple GPUs. Figure 14 shows the speedup obtained by varying the number of GPUs for these applications. We do not vary the workload for multiple GPUs, hence Figure 14 shows the result of strong scaling for both CUDA and HUM. The speedup is obtained over the case of a single GPU for each of CUDA and HUM. The result indicates that HUM achieves scalable performance in the multi-GPU environment. The major reason for this strong scaling is that page faults occurred in different GPUs are handled by different host threads.
Related Work
There are some previous studies related to CUDA Unified Memory (UM) [1, 3, [20] [21] [22] . Landaverde et al. [20] and Li et al. [21] evaluate the performance of UM using Parboil [36] and Rodinia [4] benchmark suites. As Parboil and Rodinia do not provide UM version, they make the UM version of the benchmark applications on their own and compare their performance with the existing non-UM version. The results from both studies show that the UM version is slower than the non-UM version. One of the reasons of the slowness is that both studies use GPUs of the NVIDIA Kepler architecture [23] that does not fully support UM. Kepler architecture does not support GPU page faults. Moreover, in the Kepler architecture, all pages in the host UM space have to be migrated to the GPU UM space before a kernel is executed even if some pages are not actually accessed by the kernel.
Awan et al. propose OC-DNN [1] that exploits UM on GPUs of NVIDIA Pascal and Volta architectures [24, 29] . They port one of the well-known DNN frameworks, Caffe [34] , to UM and optimize it manually by adding various CUDA user-hint API functions. OC-DNN provides comparable performance to Caffe for popular Deep Neural Networks (DNNs), such as AlexNet [19] , GoogLeNet [37] , VGG-19 [35] , and ResNet-50 [14] . HUM exploits CUDA UM to automatically overlap the H2Dmemcpy and the computation of the host or the device without exposing UM to the programmer and without hurting performance.
Markthub et al. propose DRAGON[22] that allows NVM storage to be directly mapped to GPUs by exploiting UM. As the amount of virtual memory that is available for UM is limited by the physical host memory size, DRAGON supports a larger memory size for UM backed by NVM storage. It eliminates the need for manual buffer management for GPU kernels that access data larger than the GPU or host memory. Brokhman et al. propose GAIA [3] that is similar to DRAGON. While DRAGON exclusively focuses on allowing NVM storage to be directly accessible from the GPU, GAIA adds more optimizations, such as lazy release consistency protocol (LRC) and optimized OS read-ahead prefetcher.
Both DRAGON and GAIA are similar to HUM in that they exploit UM and modify the NVIDIA GPU driver to handle page faults. They also use prefetching techniques to overlap data transfers and kernel computation. However, their major goal and techniques are significantly different from those of HUM. First, the goal of DRAGON and GAIA is to support a large memory size that is backed by storage with minimum I/O overhead. Their GPU driver captures a page fault and manages page migration between the GPU and the storage device. On the other hand, the goal of HUM is to hide H2Dmemcpy time by overlapping the copy with the host computation or the kernel computation. The role of the GPU driver in HUM is to make the GPU pending to wait for the H2Dmemcpy when a page fault occurs during the kernel computation.
Second, while both DRAGON and GAIA require sourcecode modification to map the storage space to the GPU, HUM does not need nay source-code modification. GAIA requires more source-code modifications than DRAGON to control memory consistency using the LRC protocol.
Third, both DRAGON and GAIA use the read-ahead prefetcher provided by the operating system. This allows reading consecutive blocks ahead on a page fault. As their prefetcher starts prefetching on a page fault, they require a user to explicitly add prefetching hints if the user wants to eagerly copy the pages. On the contrary, HUM does not need any user hints. HUM proposes prefetching all pages specified in the memory copy command and their scheduling techniques to maximize the performance.
Finally, target applications are different. Both DRAGON and GAIA target CUDA programs that are designed to use UM. However, HUM targets traditional CUDA programs that use explicit device memory allocation (e.g., cudaMalloc()) and explicit device memory copy (e.g., cudaMemcpy()). Even though HUM targets traditional CUDA programs, it also correctly works with the CUDA programs that use UM.
Many studies have been performed to detect memory reuse [5, 15, 16, 39] . All these previous studies focus on data-reuse analysis at compile time. Cong et al. [5] and Issenin et al. [15, 16] statically analyze data reuse and try to hide memory latency by placing frequently reused data in scratchpad memory. HUM is different from those previous approaches in that it detects modifications to previously defined data. Moreover, it performs the detection at run time and exploits the segmentation fault mechanism.
Many techniques for overlapping host-GPU data transfers and GPU kernel computation have been proposed [2, 10, 17, 18, 32, 33] . While they require a user to manually overlap the data transfers and the kernel computations, our framework automatically does it without any code modification.
Overlapping communication and CPU/GPU computation in a cluster has also been widely studied [6-8, 11, 38] .
White III and Dongarra [38] show the effect of overlapping CPU/GPU computation, inter-node communication, and CPU-GPU communication. Danalis et al. [6] , Fishgold et al. [8] , and Danalis et al. [7] introduce compiler techniques that transform MPI code to overlap inter-node communication and CPU computation. Gysi et al. [11] propose a framework that automatically overlaps inter-node communication and GPU computation. HUM focuses on automatic overlapping of data transfers and GPU computation in a node by exploiting Unified Memory.
Conclusions
HUM hides the host-to-device memory copy time by automatically overlapping it with the host computation or the kernel computation. It exploits CUDA Unified Memory and fault mechanisms of both the host and the GPU. HUM's Unified Memory is hidden to the programmer and there is no need to modify the source code.
Since the CUDA runtime and some part of the CUDA display driver are proprietary, we implement the proposed techniques in the HUM runtime and driver that exploit the CUDA runtime and driver. As the HUM driver just changes the address of the interrupt handler in the interrupt descriptor table, it does not cause any overhead when intercepting the interrupt signals. This allows the user to easily turn on and turn off the functionality of HUM.
With 51 applications from Parboil, Rodinia, and CUDA Code Samples benchmark suites, we evaluate HUM. We compare their performance under HUM with that of their handoptimized implementations. The evaluation result shows that HUM is quite effective and practical. On average, HUM achieves 1.20x for applications in Parboil, 1.26x for Rodinia, and 1.13x for CUDA Code Samples. The average speedup of all applications under HUM is 1.21, which is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory.
