Abstract. With the advent of the big data era, highly efficient and scalable join algorithms are becoming increasingly essential for database operations. As a result, recent years witnessed a strong momentum in accelerating join algorithms with multi-and many-core processors. Among various acceleration platforms, GPUs have the advantage in terms of raw computing power and scalability. The hash join problem, however, poses unique challenges for effective GPU implementations. Especially, a complete treatment of the problem by systematically considering various GPU architectural details and input characteristics is still missing. In this work, we built a GPU-based testbed to systematically study the performance tradeoffs of developing highly efficient GPU implementations for hash join. On such a basis, we investigated a set of essential building blocks including data transfer mechanisms between host (CPU) and device (GPU) to take advantage of the PCI-E bandwidth, a streaming scheme to effectively overlap data transfer and kernel execution, and an atomic-free transformation to minimize costly synchronization overhead. By integrating these blocks, we are able to improve the hash join performance to a new level. The experimental results show that our GPU implementation of hash join outperforms the state-of-the-art results by up to 111%. We also proposed a framework to guide the selection of optimization strategies.
Introduction
The join operation is a fundamental part of modern database systems. Due to its performance advantages over other join algorithms such as nested-loops join and sort-merge join, hash join [1] has become one of the most popular join algorithms. Today the hash join operation is even more critical due to its wide deployment in main-memory databases [2] . As a result, significant effort has been dedicated to improving the performance of hash join. The existing work can be categorized into two orthogonal directions, algorithmic improvements and parallel executions. The former family of techniques, such as grace hash join [3] and the hybrid hash join [4] , overcomes the insufficiency of the original hash join, while the latter category of approaches resorts to the computing power made possible by multi-core CPUs [5] , many-core platforms exemplified by graphics processing units (GPUs) [6] , [7] , and heterogeneous platforms like FPGAs [8] and accelerated processing units (APUs) [9] . In this work, we focus on the parallel computing approach on GPUs.
Among various parallel computing platforms, GPUs offer the unique advantage of raw computing power and scalability [10] . In addition, GPUs are equipped with fast maturing programming environment like CUDA and OpenCL which dramatically alleviate coding effort. He et al. [6] pioneered in applying the CUDA programming infra-structure to the hash join problem and demonstrated the significant performance potential. After the above studies, many new features, which turned out to be essential for hash join performance, were introduced to GPU hardware. Researchers developed various solutions to keep up with the innovations in the GPU functionality. Kaldewey et al. [7] put forward that the UVA (Unified Virtual Addressing) can make optimal usage of the available PCI-E bandwidth and enhance the overall performance. However, a complete treatment of the problem by systematically considering various GPU architectural details and input characteristics is still missing. Especially, the overall performance depends on the synergy of algorithmic components and different points of improve have to be evaluated in a systematic framework.
To address the above problem, we revisit the GPU accelerated main-memory hash join problem in this paper by constructing a systematic testbed so that different algorithmic enhancements can be analyzed in an integrated manner. We analyze major components of GPU-based main-memory hash join algorithm in multiple dimensions and devise the state-of-the-art optimization solutions. The strategies are integrated into a GPU based testbed for performance evaluation. The experimental results show that our GPU implementation of hash join outperforms the state-of-the-art GPU results by at least 65% and up to 111%.
A Testbed for GPU Accelerated Hash Join

Architecture
In this work, we use the simple hash join (SHJ) listed in Algorithm 1 as it is the fundamental form of the hash join algorithms. In practice, the GPU implementation of SHJ outperforms the counterparts of other variants [7] . As shown in the algorithm form in Algorithm 1, SHJ operates on two input tables, R and S. The symbols used in this paper is listed in Table 1 . Table 1 . Symbols used in this paper.
Our testbed is based on the typical setup of CPU-GPU heterogeneous platforms in which CPU (host) is responsible for handling data I/O and flow control and GPU (device) works as a co-processing engine to deal with compute intensive tasks. The architecture of the underlying platform is shown in Figure 1 . Our testbed is a high performance workstation equipped with an Intel(R) Core(TM) i7-5960X CPU and 64GB of DDR3 main memory. An Nvidia GTX 980Ti GPU is added to the motherboard as co-processor and accessed via 16-lane PCI-E 3.0 bus with the theoretic peak bandwidth of 15.75GB/s [11] . We run the experiments on Ubuntu 16.04 with Nsight Eclipse and CUDA 8. The configuration of the hardware and software platform is outlined in Table 2 .
Implementation
We developed a GPU accelerated SHJ algorithm by following the techniques proposed by Kaldewey et al. [7] as the baseline implementation. The processing flow follows the typical host-device computing pattern as follows.
1) Initialize two input tables, R and S 2) Transfer R and S from host (CPU) to device (GPU) 3) Build hash In our implementation, all input tables (i.e. R and S) are created in the main memory. Entries in the tables have the form <id, key>, among which both the id and the key are 32-bit unsigned integers. So the size of each entry is 8 Bytes. We use the open addressing hash table H and denote the size of H as |H| = 2 * |R| initially which means the loading factor is 0.5. We choose the least significant bits (LSB) of the input key of tables R and S as the hash function and a linear probing strategy to handle hash conflicts.
We use a dataset with the same characteristics as that used by Kaldewey et al. [7] for performance profiling. To profile the performance of hash join operations, we set the initial match rate as 0.03, which means in 0.03 probability we will select a random tuple of R and assign sKey=rKey while in 0.97 probability we will select a random number out of the range of rKey. Without loss of generality, we set |R|<=|S| because we always build hash table from smaller table in general. In the beginning. After the join algorithms, we get a result table T in the main memory. With the setup of the match rate, it can be estimated that |T|≈0.03×|S|. We than use NVIDIA's Nsight tool to derive a detailed profiling of the GPU based SHJ algorithm.
Algorithmic Engineering
The above profiling results suggest potential directions of optimizations. Among the four components having an impact on the overall performance, the construction of the hash table can be fully parallelized and thus has a limited optimization space.
GPU Memory Management
The memory for storing the input, output and hash tables has to be allocated and deleted during the join process with standard functions, respectively. Figure 2 shows that such memory management can be expensive in terms of execution time. Figure 3 reports the execution time of the two GPU memory functions with regard to data size. It can be seen that the respective time is approximately proportional to the size of the data. Although the memory management can be rather expensive, the respective overhead can be amortized by allocating enough device memory at the start of running the database system and reusing the space for multiple join operations. For each reuse, only a memory clean operation is required and the respective execution time is around 6% of the memory allocation.
Data Transfer
An unavoidable problem of GPU computing is that the datasets have to be transferred between CPU and GPU. For large datasets, the respective overhead can be significant. The optimization space of data transfer is elaborated in this subsection.
Pageable Memory vs. Pinned Memory.
Here we compare the data transfer performance in terms of transfer time between pageable memory and pinned memory. From Figure 4 , it can be seen that pinned memory offers considerably shorter transfer time. Experimental results prove that the effective bandwidth of pinned memory can be as high as 10.1~11.5GB/s (4.9~5.3GB/s for pageable memory). In fact, the pinned memory enables up to 98% utilization of the effective bandwidth.
The efficiency of pinned memory actually comes with a cost because the access to them has a fixed overhead, i.e., the registration and unregistration processes. Figure 5 shows the measured cost for these two operations. In fact, frequent registration and unregistration operations may easily neutralize the performance advantage of pinned memory. Such overhead, fortunately, can be amortized in the hash join operation by reserving a large space of pinned memory for main-memory databases during initialization and then reusing the memory space in succeeding operations. Pinned Memory vs. UVA. UVA is based on the pinned memory and doesn't need explicit data transfer requests. A shared virtual memory space is provided by CUDA UVA model and mapped the host memory to the device memory. Data transfer only occurs implicitly only when kernel execution need to read or write the data. Therefore, UVA can reduce the data transfer management and accelerate the application development.
Since the UVA mechanism allows in-kernel access to the CPU main memory, we measured the total time of the table R transfer and building kernel to evaluate the performance of pinned memory and UVA. The results are listed in Figure 6 . The results suggest that the pinned memory is about 37% faster than UVA. Our results indicated that it does not enable better performance than pinned memory when executing the building kernel partly due to the random patterns of memory accesses. In fact, Negrut et al. [12] suggest that the UVA is preferable when the memory accesses have a high degree of spatial and temporal coherence. 
Overall Evaluation
In the previous section, we studied the major tradeoffs involved in the design of efficient GPU implementations of hash joins. Based on the analysis, we propose a design guidance for selecting optimization strategies for GPU accelerated hash joins. Figure 7 illustrates the optimization selection process. It allows developers to choose a series of optimized design decision according to the characteristics of input data. The baseline GPU implementation has the same parameters and algorithmic flow as those proposed by Kaldewey et al. [7] . The results shown in Figure 8 prove that out implementation outperforms the baseline GPU implementation by at least 65% and up to 111% on data size from 2MB to 64MB. 
Conclusion
In this work, we performed a systematic analysis on GPU accelerated hash join operations. By identifying performance-critical modules and potential spaces for optimization, we developed a set of optimization strategies including data transfer acceleration, streaming to overlap data transfer and kernel execution, and prefix-sum based atomic-free probing. These strategies are integrated into a GPU based testbed for performance evaluation. The experimental results show that our GPU implementation of hash join outperforms the baseline GPU implementation by up to 111% in regard to input data size from 2MB to 64MB.
