Hash tables are a fundamental data structure for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. This survey identifies key factors affecting the performance of different techniques and suggests directions for further research.
INTRODUCTION
T HE task of searching for elements in a set is a well-studied algorithm in computer science. Canonical methods for this task are primarily based on sorting, spatial partitioning, and hashing [1] . In searching via hashing, an indexable hash table data structure is used for efficient random access and storage of sparse data, enabling fast lookups on average. For many years, numerous theoretical and practical hashing approaches have been introduced and applied to problems in areas such as computer graphics, database processing, machine learning, and scientific visualization, to name a few [1] , [2] , [3] , [4] , [5] , [6] , [7] . With the emergence of multi-processor CPU systems and thread-based programming, significant research was focused on the design of concurrent, lock-free hashing techniques for single-node, CPU shared-memory [8] , [9] , [10] , [11] , [12] . Moreover, studies began to investigate external-memory (offchip) and multi-node, distributed-memory parallel techniques that could accommodate the oncoming shift towards large-scale data processing [13] , [14] . These methods, however, do not demonstrate node-level scalability for the massive number of concurrent threads and parallelism offered by current and emerging many-core architectures, particularly graphical processing units (GPUs). GPUs are specifically designed for data-parallel computation, in which the same operation is performed on different data elements in parallel.
CPU-based hashing designs face several notable challenges when ported to GPU architectures:
Sufficient parallelism: Extra instruction-and threadlevel parallelism must be exploited to cover GPU global memory latencies and utilize the thousands of smaller GPU compute cores. Data-parallel design is key to exposing this necessary parallel throughput. Memory accesses: Traditional pointer-based hash tables induce many random memory accesses that may not be aligned within the same cache line, leading to multiple global memory loads that limit throughput on the GPU. Control flow: Lock-free hash tables that can be both queried and updated induce heavy thread contention for atomic read-write memory accesses. This effectively serializes the control flow of threads and limits the thread-level parallelism on the GPU. Limited memory: CPU-based hashing leverages large onchip caching and shared memory to support randomaccess memory requests quickly. On the GPU, this fast memory is limited in size and can result in more cache misses and expensive global memory loads. In this study, we survey the state-of-the-art data-parallel hashing techniques that specifically address the abovementioned challenges in order to meet the requirements of emerging massively-parallel, many-core GPU architectures. These hashing techniques can be broadly categorized into four groups: perfect hashing, open-addressing, spatial hashing, and separate chaining. Each technique is distinguished by the manner in which it resolves collisions during the hashing procedure.
The remainder of this survey is organized as follows. Section 2 reviews the necessary background material to motivate GPU-based data-parallel hashing. Section 3 surveys the four categories of hashing techniques in detail, with some categories consisting of multiple sub-techniques. Section 4 categorizes and summarizes real-world applications of these hashing techniques at a high-level. Section 5 synthesizes and presents the findings of this survey in terms of best practices and opportunities for further research. Section 6 concludes the work.
BACKGROUND
The following section reviews concepts that are related to GPU-based data-parallel hashing.
Scalable Parallelism
The increase in available parallelism provided by emerging architectures enables larger workloads and data to be processed in parallel [15] , [16] . Gustafson [17] noted that as a problem size grows, the amount of parallel work increases much faster than the amount of serial work. Thus, a speedup can be achieved by decreasing the serial fraction of the total work. By explicitly parallelizing fine-grained computations that operate on this data, scalable data-parallelism can be attained, whereby a single instruction is performed over multiple data elements (SIMD) in parallel (e.g., via a vector instruction), as opposed to over individual scalar data values (SISD). This differs from task-parallelism, in which multiple tasks of a program conduct multiple instructions in parallel over the same data elements (MIMD) [18] .
Graphical Processing Unit (GPU)
A graphical processing unit is a special-purpose architecture that is designed specifically for high-throughput, data-parallel computations that possess a high arithmetic intensitythe ratio of arithmetic operations to memory operations [18] . Traditionally used and hard-wired for accelerating computer graphics and image processing calculations, modern GPUs contain many times more execution cores and available thread-level parallelism (TLP) than a CPU of comparable size [19] . This inherent TLP is provided by a group of processors, each of which performs SIMD-like instructions over thousands of independent, parallel threads.
Provided by NVIDIA, the CUDA C/C++ parallel programming library provides an interface to design algorithms for execution on NVIDIA GPUs and configure the underlying hardware [19] . For the remainder of this survey, all references to a GPU will be with respect to an NVIDIA CUDA-enabled GPU, as it is the most-common platform of execution among the GPU hashing studies. The following subsection reviews important features of a modern (pre-CUDA 9 and pre-Volta series) GPU architecture; although CUDA 9 introduced new thread scheduling mechanics to accompany the Volta GPU, the following GPU concepts still apply.
On the host CPU, a program, or kernel function, is written in CUDA C/C++ and invoked for execution on the GPU. The kernel is executed N times in parallel by N different CUDA threads, which are dispatched as equally-sized thread blocks. The total number of threads is equal to the number of thread blocks times the number of threads per block, both of which are user-defined in the kernel. Thread blocks are required to be independent and can be scheduled in any order to be executed in parallel on one of several independent streaming multi-processors (SMs). The number of blocks is typically based on the number of data elements being processed by the kernel or the number of available SMs [19] . Since each SM has limited memory resources available for resident thread blocks, there is a limit to the number of threads per block-typically 1024 threads. Given these memory constraints, all SMs may be occupied at once and some thread blocks will be left inactive. As thread blocks terminate, a dedicated GPU scheduling unit launches new thread blocks onto the vacant SMs.
Each SM chip contains hundreds of ALU (arithmetic logic unit) and SFU (special function unit) compute cores and an interconnection network that provides k-way access to any of the k partitions of off-chip, high-bandwidth global DRAM memory. Memory requests first query a global L2 cache and then only proceed to global memory upon a cache miss. On-chip thread management and scheduling units pack each thread block on the SM into one or more smaller logical processing groups known as warps-typically 32 threads per warp. The thread manager ensures that each warp is allocated sufficient shared memory space and perthread registers (user-specified in kernel program). This onchip shared memory is designed to be low-latency near the compute cores and can be programmed to serve as L1 cache or different ratios thereof (newer generations now include these as separate memory spaces) [20] .
Introduced by NVIDIA in 2006 as part of the Tesla microarchitecture series [21] , Single-Instruction, Multiple Threads (SIMT) is a combination of SIMD and simultaneous multithreading (SMT) execution. SIMT execution is similar to SIMD, but differs in that SIMT applies one instruction to multiple independent warp threads in parallel, instead of to multiple data lanes. In SIMT, scalar instructions control individual threads, whereas in SIMD, vector instructions control the entire set of data lanes. This detachment from the vector-based processing enables threads of a warp to conduct a form of SMT execution, where each thread behaves more like a heavier-weight CPU thread [18] . Each thread has its own set of registers, addressable memory requests, and control flow. Warp threads may take divergent paths to complete an instruction (e.g., via conditional statements) and contribute to starvation as faster-completing threads wait for the slower threads to finish.
The two-level GPU hierarchy of warps within SMs offers massive nested parallelism over data [18] . At the outer, SM level of granularity, coarse-grained parallelism is attained by distributing thread blocks onto independent, parallel SMs for execution. Then at the inner, warp level of granularity, finegrained data and thread parallelism is achieved via the SIMT execution of an instruction among parallel warp threads, each of which operates on one or more individual data elements. The massive data-parallelism and available compute cores are provided specifically for high-throughput, arithmeticallyintense tasks with large amounts of data to be independently processed. If a high-latency memory load is made, then it is expected that the remaining warps and processors will simultaneously perform sufficient work to hide this latency; otherwise, hardware resources remain unused and yield a lower aggregate throughput [22] . The GPU design trades-off lower memory latency and larger cache sizes (such as on a CPU) for increased instruction throughput via the massive parallel multi-threading [18] .
The CUDA C Programming Guide [19] and Nvidia PTX ISA documentation [21] contain further details on the GPU architecture, execution and memory models, and CUDA programming.
The CUDA Data Parallel Primitives Library (CUDPP) [23] is a library of fundamental data-parallel primitives (DPPs) and algorithms written in Nvidia CUDA C [19] and designed for high-performance execution on CUDA-compatible GPUs. Each DPP and algorithm incorporated into the library is considered best-in-class and typically published in peerreviewed literature (e.g., radix sort [24] , [25] , mergesort [26] , [27] , and cuckoo hash table [28] , [29] ). Thus, its data-parallel implementations are constantly updated to reflect the stateof-the-art.
Packaged within each release of CUDA, the Thrust library contains a large collection of DPP (e.g., sort, scan, reduce, binary search, copy) and data structures [30] for platformportable usage across both CPUs and NVIDIA GPUs. Each DPP is back-end implemented in both the Intel Thread Building Blocks (TBB) library [31] and OpenMP API [32] for CPU execution, and in CUDA for GPU execution.
Searching via Hashing
For searching an unordered array of elements on the GPU, two canonical data structures exist: the sorted array and the hash table. Both of these data structures are known to be relatively fast to construct on the GPU and are amenable to data-parallel design patterns [33] .
Instead of maintaining elements in sorted order and performing a logarithmic number of lookups per query (e.g., via binary or p-ary search [34] ), hash tables compactly reorganize the elements such that only a constant number of direct, random-access lookups are needed on average [35] . More formally, given an unordered array S of n keys (not necessarily distinct), a hash function, h : S 7 ! H, maps the keys from S to the range H ¼ fjg 0 j < m for some arbitrary positive integer m ! n. Defining a memory space over this range of size m specifies a hash table, into which keys are inserted and queried. Thus, the hash table is addressable by the hash function. During an insertion or query operation for a key q, the hash function computes an address hðqÞ ¼ r into H. If the location H½r is empty, then q is either inserted into H½r (for an insertion) or does not exist in H (for a query). If H½r contains the key q (for a query), then either q or an associated value of q is returned, 1 indicating success. Otherwise, if multiple distinct keys q 0 6 ¼ q are hashed to the same address hðq 0 Þ ¼ r, then a situation known as a hash collision occurs. These collisions are typically resolved via separate chaining (i.e., employing linked lists to store multiple keys at a single address) or open-addressing (e.g., when an address is occupied, then store the key at the next empty address).
The occurrence of collisions deteriorates the query performance, as each of the collided keys must be iteratively inspected and compared against the query key. According to the birthday paradox, with a discrete uniform distribution hash function that outputs a value between 1 and 365 for any key, the probability that two random keys hash to the same address in a hash table of size 23 is 50 percent [36] . Thus, for a large number of keys (n) and small hash table (m), hash collisions are inevitable.
In order to minimize collisions, an initial approach is to use a good quality hash function that is both efficient to compute and distributes keys as evenly as possible throughout the hash table [35] . One such family of functions are randomly-generated, parameterized functions of the form hðkÞ ¼ ða Á k þ bÞ mod p mod jHj, where p is a large prime number and a and b are randomly-generated constants that bias h from outputting duplicate values [29] . However, h is a function of the table size, jHj. If jHj is too small, then not even the best of hash functions can avoid an increase in collisions. Given the table size, the load factor a of the table is defined as a ¼ n=jHj, or the percentage of occupied addresses in the hash table, which jHj is typically larger than n. If new keys are inserted into the table and a reaches a maximum threshold, then typically the table is allocated to a larger size and all the keys are rehashed into the table.
Finally, a hash table is static if it does not support modification after being constructed; that is, the table is only constructed to handle query operations. Thus, a static hash table also does not support mixed operations and the initial batch of insertions used to construct the table (bulk build) must be completed before the batch of query operations. A hash table that can be updated, or mutated, via insertion and deletion operations post-construction is considered dynamic.
While collision resolution is straightforward to implement in a serial CPU setting, it does not easily translate to a parallel setting, particularly on massively-threaded, dataparallel GPU architectures. GPU-based hashing presents several notable challenges:
The random-access nature of hashing can lead to disparate writes and reads by parallel-cooperating threads on the GPU, which performs best when memory accesses are coalesced or spatially coherent. The limited memory available on a GPU puts restrictions on the maximum hash table size and number of tables that can reside on device. Collision resolution schemes handle varying numbers of keys that are hashed and chained to the same address (separate chaining), or varying numbers of attempts to place a new, collided key into an empty table location (open-addressing). This variance causes some insert and query operations to require more work than others. Thus, a performance bottleneck arises when faster, non-colliding threads wait for slower, colliding threads to finish. Searching via the construction and usage of a hash table on the GPU has recently received a breadth of new research, with a variety of different parallel designs and applications, ranging from collision detection to surface rendering to nearest neighbor approximation. The following section covers these GPU-based parallel hashing approaches.
HASHING TECHNIQUES
We consider four different categories of hashing techniques: perfect hashing, open-addressing, spatial hashing, and separate chaining. Each category is discussed in a separate subsection and distinguished by its method of handling hash collisions or placement of elements within the hash table. At the end of each subsection of a technique, we summarize the most suitable use cases for the technique. Later, in Section 5, we analyze and synthesize these techniques in more detail in terms of GPU performance characteristics and usage.
Perfect Hashing
Perfect hashing maps each key to a unique address in the hash table, resulting in no collisions and enabling singleprobe queries. If the length of the hash table m is equal to 1. In practice, the values should be easily stored and accessible within an auxiliary array or via a custom arrangement within the hash table. the number of keys n, then a perfect hash function over the keys is minimal and effectively scatters, or permutes, the keys within the table.
In theory, obtaining a perfect hash function, especially for large sets of keys, is a difficult, low-probability task. As a reinterpretation of the classical birthday paradox, only one in ten million hash functions h 2 H is a perfect hash function for n ¼ 31 keys mapped into m ¼ 41 locations. When m ¼ n, P ðn; mÞ ¼ n! n n , which is the probability of achieving a minimal perfect hash [37] . For larger key set sizes n, such as those seen in practical applications, the minimal perfect hash probability decreases very rapidly and is approximated as e Àn .
In practice, a perfect hash function can be described as an imperfect hash function that is then iteratively corrected into a perfect form. Lefebvre and Hoppe [38] introduce a perfect spatial hashing (PSH) approach that is the first GPU-specific perfect hashing approach. In this approach, a minimal perfect hash function and table are constructed over a sparse set of multi-dimensional spatial data (pixel keys with RGB color values), while simultaneously ensuring locality of reference and coherency among hashed points. Thus, spatially-close points are queried coherently, in parallel, by threads within the same warp. In order to maximize memory coalescing among these threads on the GPU, points are also coherently accessed within the hash table, as opposed to via a random access pattern. The minimal perfect hash function h is composed of two imperfect hash functions, h 0 and h 1 , and an offset table F that "jitters" the imperfect functions into a perfect form via an iterative process.
Note that the construction of F is an inherently sequential process, since the assignment of offset values depends on earlier offset values or hashed points in H. Moreover, the construction of H and F is performed on the CPU, due to large memory requirements and presumed usage as a pre-processing step; thus H must be copied into GPU device memory over the PCIe bus. Moreover, H is designed to be static, since insertions or deletions of points after the initial construction will likely destroy the perfect hash and require F to be reconstructed.
In summary, perfect hashing is most suitable for hashing tasks that contain the following properties:
Need to avoid collision resolution. Do not anticipate dynamic updates and insertions. Can afford constructing the perfect hash table on the CPU host.
Open-Addressing
In open-addressing, a key is inserted into the hash table by probing through alternate table locations-the probe sequenceuntil a location is found to place the element [35] . The determination of where to place the element varies by probing scheme: some schemes probe for the first unused location (empty slot), whereas others evict the currently-residing key at the probe location (i.e., a collision) and swap in the new key. Each probe location is specified by a hash function unique to the probing scheme. Thus, some probe sequences may be more compact or greater in length than others, depending on the probing method. For a query operation, the locations of the probe sequence are computed and followed to search for the queried key in the table.
Each probing method trades-off different measures of performance with respect to GPU-based hashing. A critical influence on performance is the load factor, which is the percentage of occupied locations in the hash table (Section 2.3). As the load factor increases towards 100 percent, the number of probes needed to insert or query a key increases greatly. Once the table becomes full, probing sequences may continue indefinitely, unless bounded, and lead to insertion failure and possibly a hashing restart, whereby the hash table is reconstructed with different hash functions and parameters. Moreover, for threads within a warp on the GPU, variability in the number of probes per thread can induce branch divergence and inefficient SIMD parallelism, as all the threads will need to wait for the worst-case number of probes to execute the next instruction.
The following subsections review research on openaddressing probing for GPU-based hashing, distinguishing each study by its general probing scheme: linear probing, cuckoo hashing, double hashing, and robin hood hashing. Fig. 1 illustrates these four open-addressing probing schemes for the case of a single thread inserting a single key into a hash table.
Linear Probing Hashing
Linear probing is the most basic method of open-addressing. In this method, a key k first hashes to location hðkÞ in the hash table. Then, if the location is already occupied, k linearly searches locations hðkÞ þ 1; hðkÞ þ 2; . . .etc. until an empty slot (insertion) or k itself (query) is found. If hðkÞ is empty, then k is inserted immediately, without probing; otherwise, a worst-case OðnÞ probes will need to be made to locate k or an empty slot, where n is the size of the hash table. An improved variant of linear probing is quadratic probing, which replaces the linear probe sequence starting at hðkÞ with successive values of an arbitrary quadratic polynomial: hðkÞ þ 1 2 ; hðkÞ þ 2 2 ; . . .etc. Both of these probing methods can incur a long probe sequence to find an empty slot, possibly resulting in failure during an insert.
Bordawekar [39] develops an open-addressing approach based on multi-level bounded linear probing, where the hash table has multiple levels to reduce the number of lookups during linear probing. In the first level hash table, each key hashes to a location h 1 ðkÞ and then looks for an empty location, via linear probing, within a bounded probe region P 1 ¼ ½h 1 ðkÞ; h 1 ðkÞ þ ðj À 1Þ, where j is the size of the region. If an empty location is not found, then the key must be inserted into the second-level hash table, which is accomplished by hashing to location h 2 ðkÞ and linear probing within another, yet larger, probe region P 2 . This procedure continues for each level, until an empty location is found. In this work, only 2-level and 3-level hash tables are considered; thus, a thread must perform bounded probing on a key for at most three rounds, before declaring failure. To query a key, a thread completes the same hashing and probing procedure. In a data-parallel fashion, each thread within a warp is assigned a key from the bounded probe region and compares this key with the query key, using warp-level voting to communicate success or failure. This continues across warps, for each hash table level.
Experimental results reveal that this approach, with both two and three levels (and hash functions), does not perform as fast as the cuckoo hashing of Alcantara et al. [28] (Section 3.2.2) for the largest batches of key-value pairs (hundreds of millions); for smaller batches, the multi-level approaches are the best performers.
J€ unger et al. [40] introduce a WarpDrive hashing technique that performs integer key-value pair insertions and queries using coalesced groups (CG) of threads, each consisting of g 2 f1; 2; 4; 8; 16; 32g consecutive threads in the same thread block (g ¼ 32 corresponds to a traditional warp). A CG is initially assigned a probing window of g consecutive locations in the hash table, corresponding to the hash index hðkÞ of a pair k to be inserted. Then, within this window, each of the g threads in the CG linear probes 32=g evenlyspaced locations for an empty entry. After each probe, the threads communicate their result via a warp-wide ballot intrinsic instruction, which broadcasts a packed g-bit integer to all threads of the CG. If at least one thread discovered an unoccupied slot, then the pair is inserted in the leftmost unoccupied slot. Otherwise, the threads continue this process in lockstep for the remaining probes. If, after 32=g probes, an unoccupied slot has not been found, then the CG is assigned a different probing window of size g within the hash table, and the threads perform the linear probing again. This outer window-based probing continues for a user-specified maximum number of probes, and a similar routine is used to query key-value pairs.
The authors compare the insertion and query throughput performance of WarpDrive to that of the CUDPP cuckoo hash table from Alcantara et al. [29] (Section 3.2.2). Randomly-generated sets of 2 27 unique key-value pairs were inserted into and queried from the WarpDrive hash table using several configurations of CG size g and table load factor f. The experimental findings reveal that WarpDrive demonstrates speedups over CUDPP of up to 2:84x for insertions and 1:34x for queries at higher load factors f 2 f0:85; 0:9; 0:95g and lower CG sizes g 2 f4; 8g. While larger CG sizes increase the probability of finding an unoccupied slot within a given probing window, they may induce a lower occupancy rate on the GPU SMs, as fewer windows are available to probe in parallel. Also, as the load factor increases, larger CG sizes become more favorable but still achieve lower throughput performance than smaller sizes (g 2 f4; 8g). All single-device experiments were conducted on an NVIDIA Tesla P100 GPU (16 GB global memory) using the cooperative groups and independent thread scheduling features of CUDA 9, which enables thread groups of a variable, non-warp size. However, WarpDrive is designed to also support the traditional warp synchronization primitives of pre-CUDA 9 (i.e., for g ¼ 32).
Cuckoo Hashing
In cuckoo hashing, each key is assigned two locations in the hash table, as specified by primary and secondary hash functions [41] . When inserting a new key, its first location is probed with the primary function and the contents of the location are inspected. If the slot is empty, then the key is inserted and the probe sequence ends. Otherwise, a collided key already occupies the slot and the cuckoo eviction procedure begins. First, the occupying key is evicted and hashed to the location specified by its secondary function, where its contents are probed as before. This eviction chain continues until either the evicted key is successfully inserted or a maximum chain length is reached. If the eviction is successful, then the new key is finally inserted at its primary location (first probe). Numerous follow-up studies to this canonical approach have introduced cuckoo hashing approaches with more than two hash functions (probes) per key, a separate hash table for each hash function, and other optimizations for concurrent, mixed operations (e.g., simultaneous inserts and queries) on the GPU. These studies are surveyed as follows.
Alcantara et al. [28] introduce a two-phase hashing technique based on perfect hashing and cuckoo hashing that seeks to maximize shared-memory usage during cuckoo hashing. First, elements are hashed into bucket regions within the hash table, following the perfect hashing approach of Fredman et al. [42] . The maximum occupancy of each bucket is the number of threads in a thread block (e.g., 512), such that the entire bucket can fit within shared memory. The hash function aims to coherently map elements into buckets such that each bucket, on average, maintains an occupancy of 80 percent and contains spatiallynearby elements, enabling coalescing of memory among threads during queries. Then, within each bucket in shared memory, cuckoo hashing is performed to insert or query an element, using i ¼ 3 different hash functions h i (i.e., the multiple choices), each corresponding to a sub-table T i .
The querying throughput performance of this technique is compared against that of the perfect hashing technique of Lefebvre and Hoppe [38] and a data-parallel binary search of the elements after being sorted with the radix sort of Satish et al. [26] . Experimental results reveal the following:
For querying elements (voxels in a 3D grid) in a randomized order, the two-phase cuckoo hashing outperforms both the perfect hashing and the binary search, particularly above 5 million elements.
For querying in a sequential order, the binary search demonstrates better throughput than two-phase cuckoo hashing, due to more favorable thread branch divergence and memory coalescing among the sorted elements.
Constructing the hash table of elements with twophase cuckoo hashing is comparably-fast to radixsorting the elements, with noticeable slowdowns due to more non-coalesced write operations. Moreover, for large numbers of insertions, both approaches are magnitudes faster than constructing the perfect spatial hash table, which is initially built on the CPU and copied onto the GPU for subsequent querying. Ashkiani et al. [25] design a set of multisplit DPP for the GPU that efficiently permute elements into contiguous buckets. While this study is not focused on hashing, it recommends that the multisplit can be used to map elements into the first level of buckets in a multi-level hash table, such as the two-phase hash table of Alcantara et al. [28] . Moreover, this work contributes a reduced-bit radix sort that converges to and exceeds the performance of state-of-the-art radix sort [24] as the number of buckets is increased. Thus, if the order of insertions and queries into a bucket-based hash table are non-random and ordered, then this sorting primitive may offer an effective substitution for a bucketing procedure. These primitives have since been incorporated into the CUDPP library [23] .
Alcantara et al. [29] improve upon their original work [28] by introducing a parallel variant of cuckoo hashing that can vary in the number of hash functions, hash table size, and maximum length of a probe-and-eviction sequence. In their original work, the authors hypothesized that cuckoo hashing within GPU global memory would encounter performance bottlenecks due to the shuffling of elements each iteration and the use of global synchronization primitives; thus, shared memory was used extensively in the two-level cuckoo hashing scheme. However, in this follow-up work, a single-level hash table is constructed entirely in global memory and addressed directly with the cuckoo hash functions, without the first-level bucket hash. The cuckoo hashing dynamics remain the same, except that the probing and evicting of elements occurs over the entire global memory hash table, as opposed to the shared-memory buckets of the two-level approach.
The insertion and query throughput performance of this single-level cuckoo hash table is compared against that of Merril's radix sort plus binary search [24] and the authors' previous two-level cuckoo hash table. Experimental results reveal the following:
Insertions. For large numbers of insertions (millions of key-value pairs), the radix sort [24] becomes increasingly faster than both hashing methods, with a much higher throughput. For the same size hash table, the single-level hash table is constructed significantly faster than the two-level table, due to shorter eviction chains on average, over all insertion input sizes. Queries: Binary Search versus Hashing. For random, unordered queries, binary search probing of the radix-sorted elements is much slower than cuckoo hash probing of the elements. This arises from uncoalesced global memory reads and branch divergence for many of the threads, which use the maximum OðlogNÞ probes. Queries: Two-Level versus Single-Level. When all queried elements exist in the hash table, the single-level cuckoo hashing makes a smaller average number of probes per query than the two-level approach, leading to faster completion times. However, when a large percentage of the queried elements do not exist in the hash table, the two-level hashing needs fewer worst-case probes before declaring the element as not found. This is because the single-level hashing uses four hash functions, or probes, to query an element, whereas the two-level hashing only uses three functions. By setting the number of hash functions to three in the single-level hashing, the authors observe comparable querying throughput between the two approaches. This work has since been incorporated into the CUDPP library [23] .
Breslow et al. [43] introduce a bucketized variant of cuckoo hashing that allows for higher load factors, improved bucket load balancing, and a lower expected number of bucket lookups (less than 2) for both positive and negative queries. In this Horton table, a row is maintained for each bucket, which is denoted as either Type A or Type B. Each key is hashed by its primary hash function into the primary bucket. If the primary bucket is full, then the key either hashes, via one of its secondary hash functions, to a secondary bucket-after which we denote the key as a secondary item-or replaces a secondary item in the primary bucket. If the key is a secondary item, then it is placed in the secondary bucket that is least full; note that several secondary hash functions (and buckets/rows) can be specified. Then, the filled primary bucket is promoted (if not already) to Type B and its last stored key is evicted (moved to a secondary bucket) to make room for a compact remap entry array that stores an index, or remap entry, to the secondary bucket of each secondary item. This important feature permits all secondary items to be efficiently tracked, allowing no more than two probes and hash function evaluations per query.
Experimental results of large query sets reveal that most successful lookups occur within the primary buckets, allowing a high load factor with only one hashing probe. For all successful queries, the Horton table increases the query throughput over the baseline by 17 to 35 percent. For a set of all unsuccessful queries, the Horton table increases throughput by 73 to 89 percent over the baseline, needing only one hash probe to detect failure. In this approach, only query operations are conducted in data-parallel fashion on the GPU. The detailed insertion and construction phase is performed on the CPU, which is sufficient for query-heavy usage.
Double Hashing
Double hashing first hashes a key k to location hðkÞ in the hash table and then, if the location is already occupied, computes another independent hash h 0 ðkÞ that defines the step size to the next probing location [35] . Thus, the second probe location is hðkÞ þ i Á h 0 ðkÞ, where i is the current i-th probe in the probe sequence. This hashing and probing continues until an empty slot (insertion) or k itself (query) is found. Similar to linear and quadratic probing, if hðkÞ is empty, then k is inserted immediately, without probing. The choice of the second hash function has a large impact on performance, as it dictates the locality of consecutive probes and, thus, the opportunity for memory coalescing among threads on the GPU.
Khorasani et al. [44] introduce a stadium hashing (Stash) technique that builds and stores the hash table in out-of-core host memory, and resolves insert collisions via double hashing until an empty slot is found. In GPU global memory, a compact auxiliary ticket-board data structure is maintained to grant read and write accesses to the hash table. For each hash table location, the ticket board maintains a ticket, which consists of a single availability bit and small number of optional info bits. The availability bit indicates whether the location is empty (set to 1) or occupied by a key (set to 0), while the info bits are a small generated signature of the key to help identify the key prior to accessing its value. A larger ticket size (more info bits per key) helps improve the number of operations per second by reducing the number of expensive host memory accesses over the PCIe bus. This improvement is especially significant for unnecessary queries of elements which do not actually reside in the host hash table. Finally, within individual thread warps, a shared-memory, collaborative lanes (clStash) load-balancing scheme is used to ensure that, during insertions, all threads are kept busy, preventing starvation by unsuccessful threads.
Robin Hood Hashing
Robin Hood hashing [45] is an open-addressing technique that resolves hash collisions based on the age of the collided keys. The age of a key is the length of the probe sequence, h 1 ðkÞ; h 2 ðkÞ; . . ., needed to insert the key into an empty slot in the hash table. During a collision at a probe location, the key with the youngest age is evicted and the older key inserted into that location. The evicted key is then robin hood hashed again until it is placed in a new empty location, incrementing its age along the new probe sequence. The idea of this approach is to prevent long probe sequences by favoring keys that are difficult to place. Even in a full table with high load factor, this eviction policy guarantees an expected maximum age of Qðlog nÞ for an insert or query key. However, the worst-case maximum age M may still be higher and worse than the maximum probe sequence length of cuckoo hashing, prompting a table reconstruction in some cases. These maximum M probes will be required during queries for empty keys (those which do not exist in the hash table), unless they are detected and rejected early.
Garcia et al. [46] introduce a data-parallel robin hood hashing scheme that maintains coherency among thread memory accesses in the hash table. Neighboring threads in a warp are assigned neighboring keys to insert or query from a spatial domain (e.g., pixels in an image or voxels in a volume). By specifying a coherent hash function, both keys will be hashed near each other in the hash table and the threads can then access memory in a coalesced fashion (i.e., as part of the same memory transaction). Thus, the sequence of probes for groups of threads will likely also be conducted in a coherent manner, as nearby keys of a young age are evicted and replaced by nearby keys of an older age. In absence of coherence in the access patterns, coherent hashing brings little to no benefit compared to random access robin hood hashing or the single-level cuckoo hashing of Alcantara et al. [29] (Section 3.2.2). Thus, this approach is of particular use for applications with spatial coherence in the data, such as inserting a sparse subset of pixels from an image (e.g., all the non-white pixels) into the hash table, and then querying every pixel to reconstruct the image.
In summary, open-addressing is most suitable for hashing tasks that contain the following properties:
Desire to operate on pre-allocated, indexable hash table arrays, without pointer chasing and dynamic memory allocation. Can afford to reconstruct the hash table on the GPU upon the failure to insert one or more key-value pairs. Make effective use of GPU shared memory and warp-wide instructions.
Spatial Hashing
Most real-world use cases of searching require a data structure that can support lookups of geometric primitivese.g., point coordinates, polygonal shapes, and voxels -that exist within a multi-dimensional spatial domain, such as R 2 , R 3 , or R n . One approach is to explicitly compute a bounding box over the domain and then recursively subdivide it into smaller and smaller regions, or cells, which contain a group of primitives or a subset of the spatial domain. This subdivision hierarchy can be represented by a grid (e.g., uniform and two-level) or tree (e.g., k-d tree, octree, or bounding volume hierarchy) data structure that conducts a query operation by traversing a path through the hierarchy until the queried primitive is found. While these structures are designed for fast, highly-parallel usage, they typically do not exhibit fast reconstruction rates due to complex spatial hierarchies, and may contain deep tree structures that are conducive to thread branch divergence during parallel query traversals. These attributes are particularly important to real-time, interactive applications, such as surface reconstruction and rendering, that make frequent updates and queries to the acceleration structure.
An alternative approach that addresses these limitations is to perform spatial hashing over the primitives, whereby the multi-dimensional domain is projected, or compressed, to a single dimension in the form of a hash table data structure. Instead of computing a bounding box over the spatial domain and explicitly storing the entire space, spatial hashing assumes an implicit, infinite regular grid over the domain and maps each positional primitive (e.g., a point coordinate) to a uniformly-sized and axis-aligned cell within the grid. Each cell is uniquely addressed by unit coordinates and contains a user-specified number of primitives within its bounds [47] . These coordinates are used by the hash function to hash the cell into the hash table. Two or more cells may hash to the same address, resulting in collisions that must be resolved. To query a primitive, the primitive is mapped to its cell and the cell is hashed to an address in the hash table. From this address, the cell is searched, using more than one probe if a collision occurs. Typically, to exploit sparsity, only non-empty cells that contain computable primitive data (e.g., pixel intensity, RGB, or density) are inserted into the hash table. A query of an empty cell will return a negative result, as it doesn't exist in the table.
This canonical grid-based voxel hashing approach was introduced by Teschner et al. [48] as a CPU-based search structure for detecting colliding 3D tetrahedral cells in R 3 domain space. Fig. 2 illustrates voxel-based spatial hashing for the case of a single thread inserting a single point coordinate and its accompanying data into a hash table. Several follow-up studies have since introduced GPU-based spatial hashing techniques based off of this approach, and they are surveyed as follows.
Nießner et al. [49] extend the approach of Teschner et al. [48] with more sophisticated collision handling and a 3D voxel hashing scheme that is designed particularly for fast, real-time hash table updates on the GPU. An infinite uniform grid subdivides the world space into voxel blocks, each of which consists of 8 3 voxels. The world coordinates of each voxel block are hashed as an address into a bucketed hash table. During an insertion, a block probes linearly through its assigned bucket for the first empty slot that it can occupy. If a free slot is found, then the block is inserted. Otherwise, if the bucket is already full, then overflow occurs and a linked list entry in the last slot points to the next free slot in another bucket of the hash table. The block then probes along this overflow chain to find the next empty slot. Due to interconnection among buckets, each hash entry contains an offset pointer to its neighboring bucket entry, which may not be adjacent in the table. A query operation conducts similar probing to find a particular block within the hash table. Additionally, lighweight GPU atomic primitives are used to coordinate data-parallel insertions and deletions of blocks, each assigned to an individual thread. While an entire bucket is locked for writing during an insertion into the bucket, no degradation in performance is observed for a high-throughput, real-time 3D scene reconstruction experiment. Moreover, by using a larger hash table size, the number of collision is kept minimal and prevents bucket overflows into other disparate buckets, which can cause uncoalesced memory accesses among warp threads.
K€ ahler et al. [50] introduce a GPU-based hierarchical voxel block hashing technique that uses multiple hash tables in a hierarchy to store finer and finer resolutions of grid discretitzation for voxel blocks (cells). Initially, each block is hashed to an entry in a first-level hash table of coarse resolution. Then, if the voxels within this block are represented at a finer resolution-as indicated by a flag in the entry of each hash entry-the block is hashed again with a different hash function into a second-level hash table. This hierarchical hashing continues until an entry is reached that contains a pointer to the voxel block array, which stores the actual, individual block data. Atomic voxel block splitting and merging operations are supported to enable the addition or removal of hash table entries for finer or coarser resolutions, respectively.
Chentanez et al. [51] introduce a GPU-based spatial hashing variant of Teschner et al. [48] for detecting and deleting overlapping triangles on the surface of a 3D mesh volume, as vertices are advected (i.e., mesh refinement). In this work, the 3D bounding cells of triangles are inserted into a specially-arranged hash table using the coordinate-based hash function from [48] . The hash table consists of n buckets each with m available slots (n Á m entries), and the first n entries of the table are reserved to store counts of the number of slots j m that are occupied in each bucket. Thus, the total allocated size of the table is nð1 þ mÞ. During an insertion of a cell k into bucket hðkÞ ¼ b, the thread assigned to cell k first checks the occupancy count value for bucket b.
If b has open slots, then k is inserted into the first available slot and the count for b is atomically incremented. Otherwise, the thread examines the count for the next bucket b þ 1 and inserts k into the first open slot of b þ 1, if possible, so on and so forth until k is successfully inserted. This is a modified collision resolution scheme whereby a bucket collision only occurs when the bucket is full and subsequent buckets are then linearly-probed for one that has an empty slot. During a cell query, the same linear probing over buckets is performed, beginning with the bucket to which the cell is hashed.
Note that, in this approach, thousands of other parallel threads are executing the same operation on different triangle cells, likely resulting in high contention for atomic writes for the bucket count values and worst-case linear probing sequences that induce branch divergence within warps. The extent of such divergence depends on the size m of each bucket and whether locality of reference is maintained among bucket entries when hashing spatially-nearby cells.
Tumblin et al. [52] expand upon traditional perfect spatial hashing with a compact spatial hashing (CSH) variant that compacts a perfect hash table when it becomes too sparse, saving unused memory on the GPU. As a larger number of keys need to be hashed, a sufficiently large hash table must be allocated to construct a perfect hash among the keys. Often, this large table still contains many empty locations, resulting in a low occupancy and high compressiblity, which is the ratio of available table locations to occupied locations. A compression function randomizes the original hash locations of each key and fits them within a smaller, compact hash table of size proportional to the number of keys divided by a desired load factor. Since perfect hashing is collision-free, this compaction inevitably induces collisions, which are handled in this work by a canonical quadratic probing open-addressing method in parallel. The goal of Fig. 2 . Illustration of spatial hashing for inserting and storing the data of a point coordinate into a hash table. The world coordinate space is partitioned into voxel bounding boxes, each of which contain zero or more points. First, a point identifies its voxel. Then, the voxel hashes to a bucket of multiple entries, and probes for an empty entry into which it will be inserted.
the compression function, thus, is to reduce the occurrence of collisions via random scattering of keys.
Experimental results for an adaptive mesh refinement (AMR) task show that as the perfect hash table reaches 20 to 40 times the size of the compact hash table, the CSH becomes the faster method. Thus, the exceedingly larger memory of perfect spatial hashing offsets the extra costs (e.g., thread divergence and uncoalesced memory) of resolving collisions and querying in CSH.
Duan et al. [53] present an exclusive grouped spatial hashing (EGSH) scheme that is optimized to compactly represent multi-dimensional domains that contain repetitive data (e.g., duplicate RGB or density values). The goal of this approach is to identify all groups of points that share the same data values and then, for each group, compress its points into a single group-wide value, avoiding the unnecessary storage of duplicates, which are significantly prevalent in some domains. This grouped hashing is performed over multiple iterations using multi-level hash tables until each group has been exclusively hashed into a unique table location.
Experiments on the GPU reveal that after several iterations of EGSH, the input domain becomes very sparse and has a rapid reduction in the amount of repetitive data (uncompressed groups). Both of these traits are highly suitable for the perfect spatial hashing of Lefebvre and Hoppe [38] (section 3.1), which similarly provides constanttime random accesses. Thus, an optimized variant of EGSH performs exclusive grouped hashing for a small number, k, of iterations-generating k levels of hash tables-and then applies the perfect spatial hashing on the remaining uncompressed input domain.
In summary, spatial hashing is most suitable for hashing tasks that contain the following properties:
Duplicate or collision detection of spatial elements (e.g., querying overlapping voxels of intersecting triangles). Data, such as pixel RGB or intensity, that must be stored for each spatial element. Repetitive or aggregate data values for spatial elements within the same bounding voxel.
Separate Chaining
Separate chaining is a classic collision resolution technique that uses a linked list or node-based data structure to store multiple collided keys at a single hash table entry. Each hash table entry contains a pointer, or memory address, to a head node of a linked list, or chain. Each node in the linked list consists of a key, associated value (optional), and a pointer to the next node in the list, if any. If a single key hashes to an entry, then the linked list consists of a single node with a null pointer to the non-existent next node. Otherwise, if multiple keys collide and hash to the same location, then the linked list forms a chain of these keys, each represented by a separate node in the list. During a query operation, a key hashes to an entry in the table and then iterates through each of the nodes of the chain referenced at the entry, searching for a matching key. Fig. 3 demonstrates separate chaining for the case of a single thread inserting a single key into a hash table.
In the context of parallel hashing, separate chaining must synchronize collisions during key insertions to ensure that the linked list data structures are properly allocated and constructed. Moreover, a dynamic memory allocation scheme must ensure that concurrent threads conducting insert operations properly synchronize their requests for new available blocks of memory (see [54] , [55] , and [56] ). Similar design challenges exist for the deletion of keys, and the simultaneous execution of queries by threads must avoid readerwriter race conditions that result in faulty memory accesses to incorrect or deallocated nodes (keys).
Moazeni and Sarrafzadeh [57] and Misra and Chaudhuri [58] deploy some of the earliest lock-free, separate chainingbased hash tables on a GPU architecture. Using CUDA atomic CAS operations (atomicCAS and atomicInc), both approaches support batches of concurrent query and insert operations, with only [58] also supporting delete operations. [57] achieves a significant execution time speedup for queries over counterpart lock-based and OpenMP-based CPU implementations. However, the lock-free table only attains significantly higher throughput (operations per second) than the OpenMP implementation for query-heavy batches (80 percent queries and 20 percent inserts). [58] demonstrates that a GPU lock-free hash table leverages a much higher degree of concurrency and throughput than a CPU implementation for both queryheavy and update-heavy workload batches. This performance increase is largely due to spreading the thread contention and atomic comparisons over multiple different hash locations, as threads work in SIMT fashion to conduct mixed operations at random locations.
Ashkiani et al. [59] propose a dynamic slab hash table on the GPU that is built upon an array of linked-lists, or slab lists, each of which represent a chain of one or more slabs, or memory units, that store collided keys. Each slab of memory is roughly the size of a warp memory transaction width (128 bytes). Thus, each warp is aligned to perform operations over the keys stored in a single slab, ensuring memory coalescing. As part of a work-cooperative work sharing (WCWS) strategy, each warp maintains a work queue that stores all the arbitrary Fig. 3 . Illustration of separate chaining for inserting an integer key into the hash table. In this technique, each hash table location consists of a chain of pointers to all the keys that hash to that location. If no keys hash to a location, then that location contains a null pointer.
operations assigned to the different threads in the warp. In a round-robin fashion, each batch of the same operation type in the queue is fully and cooperatively executed by the threads. For a given operation type, all threads perform a warp-wide ballot instruction to denote the active threads that were assigned this operation. For each active thread, the entire warp cooperates to execute the active thread's operation.
The performance of the dynamic slab hash table is compared to the static cuckoo hash table of Alcantara et al. [28] (Section 3.2.2)-which must be rebuilt upon updates-and the semi-dynamic lock-free hash table of Misra and Chaudhuri [58] . For static bulk builds, cuckoo hashing consistently achieves a higher throughput of insertions per second, while slab hashing achieves higher query throughput only when the average number of slabs per slab list is less than 1 (i.e., approximately a single "node" list). Over all configurations, cuckoo hashing attains the better query throughput. For dynamic updates, slab hashing significantly outperforms cuckoo hashing, in terms of execution time, as the number of inserted keys increases. This is due to the rebuilding of the static cuckoo hash table each time a new batch is inserted. Additionally, slab hashing significantly outperforms lock-free hashing across different distributions of mixture operations and increasing numbers of slab lists (i.e., the size of the hash table).
In summary, separate chaining is most suitable for hashing tasks that contain the following properties:
Need to dynamically resize the hash table upon new insertions and deletions. Make effective use of warp-wide work sharing in each allocated block of memory, or hash table bucket (slab hash table).
HASHING APPLICATIONS
The following section highlights a variety of real-world applications of GPU-based hashing techniques. These applications can be broadly divided into six categories, many falling within the domains of computer graphics and database processing. Many of the studies cited within each application area also introduce a novel hashing technique and are discussed in section 3; the remaining studies strictly apply one of the hashing techniques.
Collision Detection. Teschner et al. [48] and Eitz and Lixu [60] use spatial hashing to detect real-time intersections between deformable objects in a scene and tetradedral cells in 3D mesh volumes. Lefebvre and Hoppe [38] use perfect spatial hashing to detect collisions among surfaces of 3D objects. Pouchol et al. [61] use spatial hashing to model the interaction between solid objects (e.g., spheres) and fluid. Choi et al. [62] use perfect spatial hashing to detect interference between characters and obstacles in a free space mapped virtual environment. Chentanez et al. [51] use spatial hashing to detect and delete overlapping, or intersecting, triangles on the surface of 3D mesh volumes.
Adaptive Mesh Refinement (AMR). Tumblin et al. [52] use compact perfect hashing to search for neighboring cells in cell-based AMR for a shallow-water hydrodynamics simulation (e.g., AMR at the boundary of a water wave). Chentanez et al. [51] use spatial hashing to perform AMR on 3D mesh volumes, as vertices are advected in real-time.
Surface Rendering. Lefebvre and Hoppe [38] use perfect spatial hashing to render the color surfaces of 3D volumetric textures. Alcantara et al. [28] , [29] (open-addressing cuckoo hashing), Garcia et al. [46] (open-addressing robin hood hashing), Nießner et al. [49] (spatial hashing), and Duan et al. [53] (spatial hashing) all perform real-time surface rendering and reconstruction of 3D volumes within voxelized grids. Bastos and Celes [63] use perfect hashing to perform isosurface rendering and morphing of adaptively sampled distance fields (ADFs). K€ ahler et al. [50] use spatial hashing to render voxelized 3D scene models of signed distance fields (SDFs).
Interactive Drawing and Painting. Lefebvre and Hoppe [38] use perfect spatial hashing to interactively paint over 3D volumetric textures. Garcia et al. [46] use open-addressing robin hood hashing to interactively draw on 2D surfaces, such as an atlas. Eyiyurekli and Breen [64] use spatial hashing to interactively edit and draw over 3D level-set surfaces.
Database Processing. Hetherington et al. [65] and Choudhury et al. [66] use open-addressing cuckoo hashing to cache most-recently used, or working set, queries in a keyvalue store. Karnagel et al. [67] use open-addressing linear probing to perform group-by and aggregation queries from a key-value store. Zhang et al. [68] and Breslow et al. [43] use open addressing bucketized cuckoo hashing to accelerate queries and updates in key-value stores.
Similarity Search. Zhou et al. [69] use open-addressing robin hood hashing to extract the top-k most similar matches for query records in real-world document and relational datasets. Alcantara et al. [28] use open-addressing cuckoo hashing to perform geometric hashing, which is a form of 2D image matching. Pan et al. [70] , Pan and Manocha [71] , and Luka c and Zalik [72] each use locality-sensitive hashing to find the k approximate nearest neighbors (kANN) of query points within multi-dimensional record sets. Pouchol et al. [61] use spatial hashing to perform particle neighbor search within fluid and solid interaction simulations. Todd et al. [73] use multi-level bucketized hashing to identify genes with similar k-motifs, or DNA subsequences of length k.
ANALYSIS AND FUTURE WORK
This section analyzes the findings of the surveyed hashing techniques and identifies opportunities for future work. Table 1 enumerates a set of 17 hashing use case attributes and suggests the most-suitable or performant hashing technique (s) for each attribute. Due to the large number of possible subsets of use case attributes, a technique is only suggested for a single attribute. A practitioner can consult the table for a set of desired attributes, identify overlapping suggested techniques, and then investigate the suitability of these techniques for a specific task. Table 2 evaluates the most-suitable hashing techniques from Table 1 based on their ability to address optimal GPU performance criteria and utilize performant GPU hardware features. This evaluation assesses performance as it pertains to arbitrary access patterns for insertions and queries. Thus, special cases such as empty queries or ordered accesses are not considered unless a technique is specifically designed to perform well for such cases; for example, CoherentHash [46] achieves best-in-class throughput and memory coalescing among open-addressing techniques only when coherence exists among input elements and their hash table locations. The GPU performance criteria and hardware features are described as follows:
Sufficient Parallelism: The hashing technique experimentally demonstrates a sufficient throughput of insertion and query operations (operations per second) to hide global memory access latency. Memory Coalescing: All the threads in a warp access addresses within the same fetched cache line of contiguous memory. These memory requests are necessary to execute the given SIMT instruction. Control Flow: All the threads in a warp follow the same execution path for a SIMT instruction. CPU$GPU Data Transfers: The hash table is constructed and/or stored in CPU memory and then accessed from or copied onto the GPU via the interconnection bus (e.g., PCI-e); thus, the hashing experiences data transfer bandwidth latency. Shared Memory: Per-thread-block GPU memory space that is smaller in size than global DRAM memory, but offers faster memory accesses. Atomic Operations: Lightweight hardware atomic functions, such as compare-and-swap (CAS), that guard and manage hash table entries during parallel insertions, probing evictions (e.g., in cuckoo hashing), and deletions. Warp-wide Voting: Lightweight functions used by all the threads in a warp to communicate data and perform collaborative execution, such as when all warp threads query the hash table for the same key. For arbitrary, random access patterns, CuckooHash2 cuckoo hashing [29] and WarpDrive [40] both offer best-inclass throughput performance among the surveyed hashing techniques (Section 3.2.2). This is due to the small constant number of probes necessary in both the best-and worstcase scenarios. In the worst-case insertion scenario of not finding an empty slot, the cuckoo hash table demonstrates fast reconstruction rates. In the presence of spatiallyordered access patterns, the CoherentHash robin hood hashing [46] achieves greater throughput than cuckoo hashing and is robust to higher load factors (Section 3.2.4).
In the ideal, "fast-path," scenario, an open-addressing technique only requires a single atomic CAS operation for an insertion and a single random global memory access for a query. However, in a typical scenario, a variable number of probes are needed to insert and query a key, often spanning non-contiguous regions of memory. This induces noncoalesced memory accesses and control flow divergence among threads of a warp. Thus, except for WarpDrive [40] , most of the open-addressing techniques assessed in Table 2 cannot guarantee to attain memory coalescing and control flow.
The combination of radix sorting and binary searching is a very effective alternative to searching via hashing when access patterns are ordered or the data is already in nearsorted order prior to sorting. However, for interactive use, this approach naively requires a re-sort of a larger array each time new data is added. Additional research is needed to For each attribute, the most suitable or best-performing technique from one or more of the four hashing categories is denoted. Additional details regarding a technique can be found within the section of its encompassing hashing category.
investigate more-efficient data-parallel schemes for accommodating dynamic data. If data will be updated at run-time, then SlabHash [59] offers best-in-class dynamic hashing, achieving a significant increase in throughput over cuckoo hashing, which must be reconstructed after each batch of updates (Section 3.4). Moreover, as seen in Table 2 , this technique addresses each of the criteria for optimal GPU performance. Further research is needed to compare the performance of slab hashing with that of CoherentHash robin hood hashing [46] in the presence of coherent access patterns.
When data must be stored and accessed off-device in CPU memory, the use of ticketing, or key bit signatures, is beneficial to save expensive accesses for obvious non-matches during probing/querying. Future hashing approaches should assess the performance benefits of ticketing even when offdevice accesses do not occur. Maintaining the ticketing structure in shared memory appears to be particularly beneficial, as demonstrated by the StadiumHash open-addressing technique [44] .
Regardless of the data use case, shared memory should be leveraged as much as possible to perform warp operations and faster memory accesses (not necessarily coalesced). This is facilitated by sizing buckets to the size of a thread block, such as in CuckooHash1 cuckoo hashing [28] . If data must be accessed outside of shared memory, warps should be modeled as collaborative processing units the size of a memory transaction. Each thread is assigned to an entry within the loaded cache line and all threads then compare their entries (possibly empty) to the query or insert key via a warp-wide voting function. CuckooHash1 [28] , Stadium-Hash [44] , SlabHash [59] , and WarpDrive [40] make particularly good use of shared memory and warp-wide voting ( Table 2) .
Perfect hashing (Section 3.1), PerfectHash [38] , avoids collision resolution, but is not well-suited for updates, since the hash table must be reconstructed on the CPU and remain PCIe bandwidth-bound. A trade-off arises: either use multiple separate hash tables (and multiple probes), or use a single addressable hash table and construct the offset table, which is the primary bottleneck during construction. Further research towards constructing the offset table in data-parallel on the GPU is needed to make perfect hashing a more dynamic, interactive solution.
GPU-specific solutions should also be explored for hopscotch hashing and path hashing, two hashing techniques that have demonstrated promising CPU-based performance and cache-line utilization. The hopscotch hashing of Herlihy et al. [74] is a form of open-addressing that performs a sequence of probes and displacements to insert a key-value pair within a fixed neighborhood around its hash location. The path hashing of Zuo and Hua [75] is designed for use in devices with next-generation non-volatile memory (NVM), which possesses a high write latency. This technique performs fewer hash insertion writes than standard open-addressing techniques by immediately storing colliding keys in an upper-level buffer and non-colliding keys in the lower-level hash table.
Finally, CompactHash [52] offers the useful feature of downsizing a perfect hash table that contains a significant number of unused entries, which arises often in spatial hashing. This comes with the trade-off of new hash collisions that must be resolved. Further research should assess the viability of this approach for other types of hash tables and varying load factors.
CONCLUSION
This paper provides a survey of parallel hashing techniques for GPU architectures. These techniques are categorized according to the method of collision resolution: perfect hashing, open-addressing, spatial hashing, and separate chaining. Each of the surveyed studies offer various design 
@ Â Â @ @ @ @ Perfect Hashing:
-PerfectHash [38] @ @ @ @ Â Â Â Spatial Hashing:
The techniques are grouped by category and represent the subset of techniques that are identified as highly-suitable for different use-case attributes in Table 1 .
choices and patterns that help inform a more-general set of best practices for performant hashing on the GPU. These best practices and the most-suitable hashing techniques for different use-case factors are analyzed and used to reveal opportunities for future research.
