In the "Big Data" era, fast lookup of keys in a key/value store is a ubiquitous operation. We have designed a near memory accelerator combining simple hardware building blocks to accelerate lookup in a hash table based key/value store. We report on the co-design of hardware and software to accomplish fast lookup using open addressing. The accelerator implements a batch get command to look up a set of keys in a single request. Using an FPGA emulator, we evaluate the performance of a query workload under a comprehensive range of conditions such as hash table load factor (fill) and query key repeat distribution (likelihood of a key to reappear in a query workload). We emulate two memory configurations: Hybrid Memory Cube (or High Bandwidth Memory), and Storage Class Memory. Our design shows 12.8X -2.9X speedup compared to conventional CPU lookup depending on workload characteristics.
INTRODUCTION
Key/value lookup has become a dominant component of data analytics workloads. A profusion of key/value store implementations has emerged such as Memcached [9] , Riak [3] , and Redis [2] , to name just a few that are in widespread use. Applications include URL or more general web object caching, log data analysis, de-duplication, and uses related to e-commerce. Additional use cases include scientific data such as k-length genomic sequences [4, 10] or keys derived from hashes of large objects such as SHA-1 or other cryptographic hash.
Key/value lookup is an appealing function to be resolved in a near data processor, as in the absence of a true content addressable memory, key/value stores use indexing data structures that potentially search many memory locations to retrieve the value associated with a key. Additionally, a random sequence of lookup requests is likely to make poor use of processor caches for a realistic index size, requiring multiple memory accesses that bring unneeded bytes into the CPU cache hierarchy. The particular indexing data structure applied to a key/value store depends on the use case: in general, if ordered or range accesses are needed, tree-based indexes are used such as red-black trees or B-trees, whereas for keys that don't require ordering, hash tables are more appropriate. For near memory acceleration, the hash table potentially has greater benefit as it can avoid pointer chain traversal.
Hash table lookup accelerator
In this work, we focus on hash table organization and lookup in a near memory hardware lookup accelerator. To benefit from sequential, streaming memory access, we use open addressing [8] in which collisions are resolved by placing colliding keys in near proximity. We have designed a collection of general purpose, pipelineable hardware primitives as building blocks for hash table lookup. In this application, the lookup accelerator combines hardware hash units, gather scatter units [11] , comparators, and FIFO communication blocks to achieve 64.3 M -9.13 M lookups/s for 64-bit keys in a 1.25 GHz memory-co-located processing pipeline. As measured by our FPGA emulator, the hardware accelerated batch get operation has speedup 12.8X -2.9X over software running on a 2.57 GHz CPU core.
Related work
Efficient associative indexing and optimizing management of key/value stores in both DRAM and Flash have been studied extensively. A hardware hash table design is proposed in [15] and evaluated on an FPGA-based network adapter functioning as a Memcached appliance. Up to 13 million requests per second are demonstrated to the key/value store through the network interface. Our implementation handles both cache and complete hash tables. [21] studies opportunities for the join data analytics operation in a near memory processor and concludes that locality, granularity of access, and the stacked memory device internal architecture all play a role in optimizing join processing. A modified Redis [13] takes advantage of Storage Class Memory (SCM) by using DRAM to hold the index and SCM to hold a minimal subset of the dataset. Our implementation doesn't use an index and stores the entire table in either DRAM or SCM. An in-memory pointer chasing accelerator is proposed in [14] that leverages a logic layer in 3D stacked memory to improve parallelism in the presence of serial accesses through linked data structures with a split-transaction throughput acceleration approach, and performing virtual-to-physical translation in memory with a region-based page table. We use a pipelined design with no linked lists. [18] proposes compact indexing data structures -partialkey cuckoo hashing and entropy-coded tries to reduce the size of a key/value entry. Our implementation uses a linear probe sequence to optimize for sequential data stream bursts. A Flash-based key/value store is accelerated in [22] by a pipelined hardware accelerator to improve query performance and energy efficiency by double hashing the key and caching the hash table index in DRAM, achieving a query rate of 20 million queries/s. The Bluecache project [25] for a distributed key/value cache in Flash memory also uses hardware acceleration, places the hash table index in DRAM, and employs multiple optimizations in the hashing algorithms to reduce the size of a key/value entry. [20] describes a hardware algorithm to count k-mers in bioinformatics datasets stored on the Hybrid Memory Cube (HMC) using Bloom Filters.
Workload analysis using Facebook Memcached deployment is reported in [5] using traces with over 284 billion requests from five different use cases. Their analysis reveals a 30:1 get:put ratio for significant use cases. We have used this analysis to set workload parameters to evaluate our lookup accelerator performance. In [23] the Quartz NVM emulator is implemented on several Intel Xeon processor architectures by injecting software delays to mimic latencies of proposed NVMs. Their approach shows error of .2%-9% on workloads evaluated.
Contributions
Our work benefits from these prior designs and findings. Our contributions are
• designing simple, re-usable, interconnecting hardware building blocks suitable for near memory processing • adapting open addressing, a highly optimized hash function, and multiple insertion schemes to use near memory hardware lookup • assembling the building blocks into a hardware pipeline for optimal throughput In another variation for the use case that the k/v store is itself a cache of a larger database, a set associative scheme is used, and space is reserved at each table bucket for a fixed number of keys. Keys that hash to the same bucket are added into that bucket until the bucket is full, after which an arbitration scheme determines a key to evict to make room for a new colliding key.
Open address hashing
In contrast to the above methods, our design uses open addressing. In open address hashing, a collision is resolved by searching adjacent table buckets for an available position. The adjacent entries searched are called the "probe sequence," and the maximum number of entries scanned is called the probe sequence length. Open addressing differs from the set associative approach: for set associative, space is reserved at every bucket in the table for a fixed number of entries, whereas in open addressing, the table bucket contains one entry, and a different bucket is coopted in the case of a collision. The probe sequence may be linear, strided, or assembled by some other algorithm. On insert of a key K1, if the bucket is already filled by a key K0, the probe sequence is scanned and K1 is inserted into an empty bucket. If there isn't an empty bucket up to the maximum probe sequence length (which in some algorithms may be the entire Figure 1 . The table has N buckets. One of the keys ("Mike") can be inserted directly into the hash address bucket. The other two keys collide at the hash address, and so must be inserted in the next available location near the hash address. Thus "John" has probe sequence length 3, and "Sue" has a length of 2.
Open addressing is attractive for near memory hash tables as it allows for the probe sequence bytes to be streamed sequentially from memory. It also enables a constant rate, deterministic pipeline of lookups. However, this technique requires a very high quality hash algorithm to spread the keys as evenly as possible. Our hash algorithm is adapted from a high quality hash function "spooky hash" developed by Bob Jenkins [16] that has been modified for optimized hardware implementation.
Open addressing is most beneficial for small keys and values as the table space must be reserved at the outset for the maximum expected number of entries. In this work we focus on accelerating the linear probe access pattern of lookup consisting of 64-bit keys and 32-bit values. The 32-bit value field is sufficient in size to handle pointers to data objects on our 32-bit evaluation platform. On a 64-bit architecture, the value field would typically be 64-bits.
Our lookup pipeline is compatible with multiple open addressing insertion algorithms, notably linear and Robin Hood algorithms. In linear insertion, collisions are resolved by searching subsequent table locations until an empty bucket is found, and then inserting in the first available location. Linear insertion is susceptible to clustering in the table, and leads to large variance in key lookup time and a long tail distribution of probe lengths.
Robin Hood hashing [7] uses a simple variation of linear insertion to avoid clustering. If the probe length of incoming key K1 is greater that the probe length of a key K2 in the probe sequence, K1 displaces K2, and a new location is searched to insert K2. By swapping an entry with one that has a shorter probe sequence length, Robin Hood hashing reduces the maximum probe sequence length and also substantially reduces the long tail in probe sequence length distribution compared to linear insertion. In our experiments, hash tables are built using the Robin Hood hashing algorithm.
Batched hash table lookup
Most key/value store implementations provide a standard interface to insert a key and associated value and to search for a key, retrieving the value if the key is found. With a near memory lookup unit, synchronization between the main CPU core and lookup accelerator incurs a measurable cost, at minimum a cache line write to send the command and a subsequent cache line read to get the value or "not found" indicator. This overhead accrues for each request to the accelerator. It is clearly advantageous to increase the granularity of each transaction between the CPU core and lookup accelerator. To reduce per-request synchronization overhead, our implementation provides a batch get command. In the batch get variant, the CPU core sends the accelerator a single command with the address of a batch of keys to lookup along with an address to place the batch of results. The lookup accelerator processes the entire batch and then writes a completion event to the CPU core. Figure 2 shows the high level architecture of interconnected near memory building blocks from which the hash table lookup accelerator is built. The upper half of the memory subsystem contains the memory units. We assume an HMCor High Bandwidth Memory (HBM)-like memory consisting of a collection of memory channels accessed through the memory interconnect. The CPU can access memory through the memory interconnect independently of the blocks on the lower half of the subsystem. The bottom part of the diagram shows the hardware building blocks available for near memory processing. The CPU communicates with the accelerator through the host control interface, which consists of a small set of memory-mapped registers visible to the CPU. Through this interface and the stream interconnect, control messages between the CPU and accelerator units are exchanged. Messages are used to send commands, return status and synchronize the handoff of shared data. The controllers run simple programs that can offload the CPU in managing smaller tasks and sending commands to accelerator units. Other building blocks include Load-Store Units (LSU) used for gather-scatter, hash units, and compare units. A static RAM (SRAM) scratchpad is used to stage blocks of inputs and outputs between memory channels and processing units. The interconnect is shown as The lookup accelerator consists of a collection of units configured in a pipeline. As shown in Figure 3 , multiple lookup accelerators can be instantiated in logic near memory and share access to the main memory channels. In our implementation, a single lookup pipeline is used and the main CPU performs all control functions without the use of a controller unit. Figure 4 shows in greater detail the pipeline implementing a batch get command. When Load-Store Unit LSU0-R receives a start command from the CPU core, it transfers the keys with DMA to the split unit. One stream goes to the hash unit, which generates a stream of hash indices to LSU1-R, a gather unit. LSU1-R reads a probe sequence length of bytes starting at each of the indexed table entries (labeled Buckets in the diagram) and streams them to a compare-select unit. The latter block matches the original stream of keys from the splitter with the probe sequence from LSU1-R to look for a matching entry. The matched entry is forwarded to LSU1-W, which writes the 4-byte value to the scratchpad. When the batch is completed, LSU1-W notifies the CPU core, and the CPU core reads the results from the scratchpad. The pipeline can process a maximum of one bucket every two clock cycles. Two clock cycles are required because the data path to the compare-select unit is 8 bytes and a bucket is 16 bytes. A 16-byte data path would reduce the delay to a single clock cycle. When waiting for a memory access, the pipeline will stall as needed. The delivered lookups/sec depends on the memory characteristics and probe sequence length.
LOOKUP ACCELERATOR OVERVIEW

EXPERIMENTAL EVALUATION 4.1 Emulator
To study prototype hardware logic for the lookup accelerator components and to evaluate throughput, we have implemented the lookup accelerator in an FPGA. The hardware runs within the Logic in Memory Emulator (LiME) [1, 19] , a hardware/software infrastructure that uses the Xilinx ZC706 development board with Zynq 7045 System-on-Chip (SoC). The Zynq chip consists of two main sections, one with fixed logic containing two 32-bit ARM cores, and the other programmable logic (FPGA). The lookup accelerator is mapped to the FPGA section, which is used to emulate a logic layer located near memory as depicted in Figure 3 . Memory channels are mapped to a common 1 GB DDR3 memory shared between the FPGA logic and the ARM processor cores. This memory holds both application code and the key/value table. The ARM processor cores have separate L1 (32 KB, 4-way set-associative instruction and data) and shared L2 (512 KB, 8-way set-associative) caches. For the lookup accelerator experiment, the software runs "bare metal" on an ARM core to populate the table, generate the query workload, and communicate with the hardware lookup accelerator to process block get requests.
Since the FPGA logic on the emulation board is limited to a maximum clock frequency of about 200 MHz, the CPU is slowed to run at a comparable frequency. Higher clock frequencies for the CPU and accelerator are emulated by scaling the actual frequencies by a factor of 20. Table 1 illustrates the correspondence between the actual and emulated specifications. Scaling the actual DRAM timing usually gives a latency that is too low for the target system. Memory transactions are delayed by the amount needed to reach the emulated latency. In this example, a delay equivalent to 91 ns must be added to each memory transaction to reach the desired memory latency. The ARM cores can run at a clock frequency ranging from 1 MHz to 800 MHz. For these experiments, the ARM runs at 128.6 MHz and when multiplied by a scaling factor of 20, represents the emulated CPU frequency of 2.57 GHz. The lookup accelerator in FPGA logic runs at an actual frequency of 62.5 MHz, but when scaled by the same factor of 20, emulates a frequency of 1.25 GHz. The emulation infrastructure contains programmable delay units in the FPGA fabric to emulate different memory latencies. The delay units are programmable at a range of 0-262 us in 0.25 ns increments. The emulated memory subsystem has multiple memory channels and is capable of accepting up to 16 concurrent memory requests of various size. In this evaluation, the CPU issues 32-byte requests corresponding to a cache line, and the accelerator issues up to 128-byte requests.
Experiment design
Our experiment fills the key/value table with a scientific data set consisting of k-length genomic sequences, i.e. kmers. A 32 million entry table is the largest size that fits in the emulator's program memory. In open addressing, the entire table is allocated on initialization. Our experiments fill the table to varying degrees, which is the table's load factor. The load factor determines the probe sequence length, which in turn determines the lookup accelerator's bandwidth requirement.
A query may result in the key being found or that the requested item is not present in the table. The hit ratio determines the probability that the item is in the table. A low hit ratio results in the longest probe sequence length being searched, whereas a high hit ratio means the item is likely to be found with average case search length. In our design, the lookup accelerator always scans up to the longest probe sequence length in the table, but software can short circuit a search as soon as the item is found. By measuring performance at different hit ratios, the relative advantages of hardware acceleration vs. software are illuminated.
Another differentiating factor of table lookup is a key's popularity, i.e. how often is the same item looked up. For some applications such as social media [5] , a particular topic or URL may dominate searches for a period of time, whereas other use cases may have a more random key repeat frequency. The high repeat frequency scenario is modeled by a Zipfian distribution, and the random case uses the uniform distribution. By employing a high quality hash algorithm we effectively randomize the key's hash index. We sweep the load factor across a large range, resulting in a full range of probe sequence lengths. The key repeat frequency models two representative query patterns. These techniques enable us to model a large range of workloads using the single k-mer data set.
A final parameter most relevant to near memory processing is the memory latency. Memory latency may vary from 10 ns for SRAM to microseconds for slower memory technology. The experiments use two different frequency groups. The 85 ns read with 106 ns write case models HMC/HBM memory latency as measured on a Gen 1 4 GB HMC with 128 byte packets. The 200 ns read with 400 ns write case models the low end of proposed SCM latency projections.
The query block size is 1024 keys. Table 2 summarizes the parameters that are swept in this experimental evaluation.
In the evaluation, we compare three lookup algorithms:
• "Accel," the near memory hardware lookup accelerator as described in Section 3, • "Soft," a software version of the hardware lookup algorithm using the identical open addressing and hash algorithm. Unlike the hardware, the software algorithm terminates probe sequence search as soon as a key has been found.
• "STL," a hash table that uses the Standard Template Library (STL) unordered map.
Since STL uses more memory than Soft, the load factor could only be measured up to 70% for STL. Timings are measured for the batch get operation as issued by the CPU core. The software runs on the emulated 32-bit 2.57 GHz ARM processor.
Experiment findings
Results from running the k-mer query workload on the emulated near memory lookup accelerator and associated CPU core indicate significant performance benefit to the proposed lookup accelerator. Figure 5 graphs the effect of increasing hash table occupancy (i.e. load factor) on the lookup rate for the three versions. For a sparsely filled table, hardware acceleration gives 64.3 M lookups/s. The trend line decreases linearly with increasing occupancy, with 9.13 M lookups/s at 90% load factor. Both software versions show slightly decreasing performance ranging from 5 M at light occupancy to 2.6 M at 90% for Soft and slightly less for STL. The graph R200,W400
Figure 6: Lookup rates of HMC and SCM using Zipfian key query repeat frequency at 90% hit rate shows a workload with uniform key query frequency. Zipfian is not shown since the accelerator trend line is nearly identical to uniform. Hardware speed for Zipfian is the same as for uniform since the hardware pipeline scans the entire max probe sequence length. Software lookup rate for Zipfian goes from a high of 5.8 M lookups/s at 10% load factor to 4.76 M at 90% load factor. The increased performance of the accelerator comes from parallelism in the pipeline and from having up to 16 outstanding near memory requests. In contrast, the software versions serialize all operations and have only a few outstanding far memory requests if the CPU does prefetch. The complete suite of experiments was also done using the SCM latency profile. Figure 6 compares HMC vs. SCM performance on the hardware accelerated lookup workload using a Zipfian key frequency and 90% hit rate. While query performance with HMC is more than double the performance with SCM at a 10% load, it decreases more steeply and converges with SCM at a load factor of 90%. At higher load factors, the long probe sequence length results in optimal sequential DMA bursts that maximize memory bandwidth, shifting the bottleneck to the hardware accelerator pipeline.
The speedup trends comparing uniform and Zipfian key distribution in HMC-like memory are shown in Figure 7 . They both follow the hardware accelerator's lookup rate as the load factor increases. The uniform key repeat pattern starts at the high of 12.8X over software at low occupancy, it decreases linearly to 3.5X at the 90% load factor. Zipfian shows a similar trend, but has slightly less speedup, showing the benefit of CPU cache for repeat queries. It has speedup of 10.2X at 10% load factor trending down to speedup of 2.9X at the highest load factor.
The chart in Figure 8 shows the same comparison with SCM. Speedup with SCM is lower than with HMC-like memory at a low load factor, but is greater as the table occupancy increases. At higher hash table load factors, the software version is disadvantaged when a smaller proportion of the probes are satisfied from cache, resulting in many, serialized, cache-line-sized memory requests. This effect is pronounced with longer SCM latencies. The accelerator is less affected by latency, because of its larger, overlapped memory accesses. Furthermore, as the hash table approaches its capacity, the probe sequence length increases rapidly.
Figures 9 and 10 quantify the speedup differences between Zipfian query workloads that usually don't find the key (10% hit rate) and those that almost always find the key (90% hit rate). The hardware accelerator query throughput tracks load factor, as the hardware scans the entire max probe sequence length for each query. However, software query rate varies since the query can complete as soon as the key is found. The charts show that at lower load factors, hit rate doesn't affect speedup, which is around 10X in the HMC case. At a 90% hit rate, speedup decreases monotonically as the load factor increases. However, at a 10% hit rate, speedup decreases much less at moderate load factors and even increases a significant amount at higher load factors. With SCM, an even more pronounced increase in speedup is observed at higher load factors. We attribute the upward trend for a low hit rate to the fact that at higher load factors, the probe sequence length is longer and software must scan the whole sequence to determine that the key doesn't exist in the table.
The high latency of SCM pushes that trend even more. Our analysis has swept the critical factors affecting hash table performance applicable across a very wide range of use cases and found that in every case, the near memory hardware hash table lookup accelerator is significantly faster, from a maximum of 12.8X for a 10% load factor with 90% hit rate on a uniform key query frequency in HMC-like memory, to a minimum of 2.9X with 90% load factor and 90% hit rate with a Zipfian key frequency in HMC-like memory. Overall the speedup is higher in SCM for the most challenging workloads, where the latency of bringing the SCM cache lines to the CPU affect CPU performance.
Discussion and future experiments
We first note that we implement a bulk get operation: the overhead of synchronization with the accelerator is amortized by having each request process a 1 K block of queries. This approach is suitable for data analytics and for streaming network requests. We analyze 8 byte keys and 4 byte values, well suited to indexing schemes that hash the variable length key into a fixed size and use the value field as a pointer to the actual key and value.
Communication between the CPU and the accelerator incurs a measurable overhead. Two main components contribute to this overhead: 1) cache flush and invalidate operations on shared buffers used to communicate keys and results, and 2) command messages that indicate the start and end of accelerator activity. Cache management overhead is 2.72 us per 1 K block of keys or an average of 2.65 ns per key. Messaging overhead between the CPU and accelerator also requires about one microsecond per 1 K block of keys. On an x86 Linux platform, the overhead of sending a message through a PCIe user-space driver to an accelerator is about 1.5 us. All together, the communication overhead for a 1 K block ranges between 3% and 25% depending on the load factor.
These results have been measured in the FPGA emulator's environment. The lookup benchmark software runs on a physical ARM A9 core and the hardware pipeline runs on the FPGA fabric. The ARM clock frequency and the FPGA design clock frequency are set to emulate a 2.57 GHz clock on the ARM core and 1.25 GHz clock for the entire hardware pipeline. Memory latency is emulated by setting delay unit parameters. The emulation environment is 100% accurate for these parameters. It is also highly efficient, with a slowdown over real time of only 20, in contrast to software simulation of CPU, cache, and memory which have thousands of times slowdown.
A limitation of our approach is that CPU microarchitecture and cache hierarchy are fixed. The effects of modifications to the instruction set, microarchitecture, or cache hierarchy cannot be explored. For example, cache flush and invalidate are implemented with instructions specific to the ARM architecture. It would be illuminating to explore the tradeoffs of introducing region-based cache-memory synchronization. Running the query workbench in a full software CPU simulator such as gem5 [6] and an HMC or HBM simulator e.g. [17, 24] would enable these aspects to be studied.
Our specific platform, a Zynq development board, is attractive from the cost viewpoint, but it limits us to a table under 1 GB. Porting the emulator to a platform with more external memory would allow evaluation with larger workloads.
The hardware load store units use the hashed keys as offsets from the base address of the hash table to form the final address issued to the memory. The hardware does not perform virtual-to-physical translation, and in these experiments the entire hash table is in a physically continuous region of memory. Other researchers have designed address translation within the memory with schemes incorporating an IO memory management unit or have used optimized page table translation as in [14] . However, it is difficult to make a generic memory-based page table translation mechanism independent of a specific CPU and OS. For an open addressing hash table, only a single, contiguous physical region of memory needs to be allocated at the outset.
The study has focused on a single CPU core and a single pipeline. Previously we have taken a single core memory trace and multiplexed it to simulate multiple independent concurrent threads. The multiplexed traces were run on an HMC [12] . While the multiplexed trace exhibits a specific memory request order that might be different in a true concurrent execution in which the cores shared cache, it nevertheless can illuminate behavior of the memory interface under load using application-specific access patterns. We hope to perform a similar study for this workload.
Finally, this study has focused on throughput performance. We plan in the future to analyze the energy profile of the hardware/software approach.
CONCLUSIONS
The key/value store is perhaps the most prevalent function in use today to support data analytics, and therefore may have sufficient commercial interest to overcome the expense of near memory hardware. To quantify the potential benefit of near memory hardware accelerated key/value store lookup, we have designed and prototyped a lookup accelerator. Our design optimizes for streaming memory requests by building an open addressing hash table to implement the key/value store.
The accelerator uses simple hardware primitives including load-store units for gather operations, hash units, comparators, splitters, and fifos. The hash unit used in these studies is the hardware implementation of one of the best, hardware friendly hash functions published. The building blocks have been composed in a synchronous high performance pipeline capable of delivering a result every few clock cycles.
The accelerator has been evaluated with a query workload sweeping a wide range of parameters and shown excellent speedup on both HMC/HBM memory and storage class memory. Our future work includes evaluation with additional CPU and cache alternatives and concurrent workloads.
ACKNOWLEDGMENTS
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344. This work was supported by Lawrence Livermore National Laboratory LDRD project 16-ERD-005. LLNL-CONF-731026
