Host Memory Buffer (HMB) is a highlighted feature of the Non-Volatile Memory Express (NVMe) protocol, which is the state-of-the-art storage interface for emerging storage devices such as Solid-State Drives (SSDs). HMB enables the underlying storage to make use of a portion of host memory for caching their address mapping information and/or user data, so that they can overcome the limited capacity of memory within the storage. This technology opens an opportunity to optimize the I/O performance costeffectively by sharing the ample host memory with the severely resource-constrained device. However, it is challenging to study the HMB-based optimization techniques in the practical system because there is no SSD development platform supporting the HMB feature as well as the commercial SSD products veil their internals, not allowing a custom extension. Motivated by this limitation, this paper presents an HMB-supported SSD development platform called HMB-SSD, which is built by faithfully extending an open-source SSD emulator. HMB-SSD enables to easily integrate and evaluate I/O techniques for the HMB feature in an SSD emulator, which closely mimics the behavior of real devices. We demonstrate the proper operation of our platform and the efficacy of the HMB feature with a case study on write buffering. In our empirical study, the SSD storage achieves a large performance benefit with the HMB-based write buffer, yielding up to 86.2% better performance than that of without it.
I. INTRODUCTION
Non-Volatile Memory Express (NVMe) is a storage interface particularly tailored for the fast non-volatile storage media such as Solid-State Drives (SSDs) [1] . Since the NVMe protocol has been designed to take full advantage of the low latency and internal parallelism of SSDs, it provides excellent performance in SSD-based storage systems. Furthermore, the NVMe protocol supports a set of extended features, which can benefit storage systems in various aspects [2] . Host Memory Buffer (HMB) is one of such features, and it allows the underlying storage to make use of the host memory whenever possible. Given the HMB support, the SSD controller is able to access the host memory (i.e., DRAM) via the high-speed NVMe interface backed by Peripheral Component Interconnect Express (PCIe), and it can use a portion of host memory The associate editor coordinating the review of this manuscript and approving it for publication was Tuo-Hung Hou . as a storage cache, placing the address mapping table or the regular data in it.
Recently, as the SSD capacity scales to terabytes, a huge amount of DRAM is accordingly required within the storage to service GBs of the address mapping table. However, growing the in-storage DRAM capacity intrinsically increases the manufacturing cost, resulting in lower competitiveness of the intended products. Due to this limitation, several research groups have explored approaches to reducing the DRAM footprint within storage, including the on-demand caching technique of the address mapping table [3] , [4] , [14] - [16] . However, these approaches essentially trade-offs the I/O performance for the memory capacity, failing to get the best of both worlds.
The HMB feature opens an opportunity to overcome this memory limit problem that emerging high-capacity SSDs are facing. By gracefully allowing a portion of host memory to the resource-constraint storage device, we can achieve cost-effective performance improvement in VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ the storage systems. As an example, the address translation of SSDs will become faster as the mapping table has the reduced cache miss ratio with a larger memory capacity. The DRAM-less SSD, which has no DRAM buffer inside a device, is the extreme example that relies on the HMB feature to alleviate its shortcomings [5] , [6] . By effectively harnessing the host memory as a storage buffer, the DRAMless SSDs can provide comparable performance to that of the DRAM-contained products, while being superior over them in terms of the cost. Moreover, when the HMB feature is supported, the host memory can be directly accessed from both the host and the device, and thus additional benefits can be obtained if it is used effectively. For example, because I/O requests from the host can be passed directly to the SSD controller via the shared memory, they can be rapidly processed, without the overhead of going through the conventional I/O stack. Despite these potential benefits of the HMB feature, it is highly challenging to develop and test the HMB-based I/O optimization techniques in practical systems because there is no SSD development framework supporting the HMB feature. To realize the HMB feature, a cross-layer modification to the entire storage stack is required, ranging from the block I/O layer in the kernel to the underlying storage controller layer, which is non-trivial to implement. In particular, the internals of SSDs, such as the Flash Translation Layer (FTL) algorithm and its architecture, are unknown as they are typically trade secrets.
Motivated by the aforementioned limitation, this paper presents an HMB-supported SSD development platform, called HMB-SSD, which is built by faithfully extending an open-source SSD emulator, FEMU [11] . HMB-SSD enables to readily deploy and evaluate the HMB-based I/O schemes in practice by closely mimicking the behavior of real devices. Our HMB extension includes a collection of APIs such as allocation and release functions of the host memory.
To verify the proper operation of our emulator, we come up with a simple HMB-based I/O scheme and implement it in our platform. The proposed scheme uses a host memory as a storage write buffer through the HMB feature. The evaluation with our prototype shows that the storage I/O performance increases significantly when the host memory is used as a write buffer. This performance enhancement is primarily due to the reduced software stack overhead; the host system can directly communicate with the SSD controller through the shared memory, and thus it can bypass the several in-between software layers. This preliminary experiment shows that the HMB feature is promising to resolve the limited memory capacity problem of the high-capacity storage devices and our development platform is thus quite timely and worthwhile.
The remainder of this paper is organized as follows. First, we describe the related work in Section II. Then, we explain our implementation of the NVMe SSD emulator and how to exploit the HMB in Section III. We evaluate our scheme with various workloads in Section IV and conclude our paper in Section V.
II. RELATED WORK A. EMULATION OF SSDS
Because the internals of commercial SSD controllers are not publicly available and thus cannot be modified, emulation or simulation for SSDs has been actively researched for a long time. Yoo et al. proposed a virtual machine-based SSD simulator called VSSIM based on QEMU that can flexibly model various SSD parameters such as the number of channels, block/page sizes, program/erase/read latencies of NAND cells, and channel switch delay [10] . Li et al. also proposed another QEMU-based SSD emulator called FEMU that was designed to model NVMe SSDs and supported many techniques to eliminate overheads in the I/O processing stage for accurate delay computation [11] . In [12] , Jung et al. proposed an SSD simulator called SimpleSSD that simplifies the nondescript features of storage internals while modeling all the detailed hardware and software characteristics. It can be easily integrated into full system simulators such as gem5 and accommodate a complete storage stack.
Song et al. presented an NVMe SSD platform called Cosmos + OpenSSD that runs on the FPGA development board [21] . When it is connected with an external PCIe port, it is recognized as an NVMe SSD. To the best of our knowledge, it is the only SSD open platform in which different FTL algorithms can be tested. Tavakkol et al. presented MQSim, which is an SSD simulator that models both NVMe SSDs and SATA SSDs [22] . MQSim models high-bandwidth protocol implementations, steady-state SSD conditions, and full endto-end latency of requests in NVMe SSDs.
B. HMB OF NVME INTERFACE
The NVMe can have scalable sets of submission and completion queues of up to 64K, and each queue can have 64K entries to support a high-speed I/O performance [7] - [9] . It also supports many optional features; the HMB was first introduced in the NVMe 1.2 specification. As shown in Fig. 1 , during SSD initialization, the host sends an Identify command to the controller, which replies with a response message including attributes such as Host Memory Buffer Preferred Size (HMPRE) and Host Memory Buffer Minimum Size (HMMIN). The SSD controller informs the host whether it supports the HMB by using these attributes. Specifically, if the value of HMPRE is not zero, it implies that the SSD supports the HMB. Finally, the HMB is activated when the host sends a Set Features command after allocating an HMB space with a size of HMPRE to the controller [13] .
Although the HMB has significant potential for an efficient I/O stack design, studies on improving I/O performance by using the HMB in SSDs are scarce. In [5] , the authors used a host DRAM of 128 MB to cache address mapping tables in DRAM-less SSDs. They demonstrated that utilizing the HMB boosts the input/output operations per second (IOPS) performance significantly compared to other DRAMless solutions. In [8] , Hong et al. used a host DRAM as a data cache instead of an address mapping table cache by modifying the NVMe command process and adding a direct memory access (DMA) path between the system memory and the host DRAM. The proposed scheme improved the I/O performance by 23% for sequential writes over the architecture with internal DRAM in the SSD. In [20] , although the HMB feature of NVMe was not directly used, Jeong et al. proposed a scheme called the host performance booster (HPB), in which a portion of the host DRAM was used as an address mapping table cache. They defined transactional protocols between the host device driver and storage device to manage the address mapping table cache in the host. By implementing the HPB on smartphones with a universal flash storage (UFS) device, they demonstrated a performance increase of up to 67% for random read workloads. These previous works do not fully account for or exploit the potential of HMB, such as using the HMB through the cooperation with a host, which is presented in this paper.
III. HMB-SUPPORTED SSD DEVELOPMENT FRAMEWORK
In this section, we describe the HMB-supported SSD development framework for easy deployment and utilization of the HMB feature in storage systems. Our development platform is implemented by extending a state-of-the-art SSD emulator called FEMU [11] . FEMU was selected because it supports the underlying SSD controllers communicating with the host-side emulator using the NVMe interface. Moreover, the source codes of FEMU are publicly available and thus it was the ideal option at the time we commenced the research on HMB. In this section, we focus on our implementation for HMB support as the other flash sub-systems are mostly similar to those of FEMU. Fig. 2 shows the overall architecture of the proposed HMB-SSD. Our framework consists of three key components: the HMB activator, the HMB allocator, and the Fast Write Buffer (FWB) manager, which are represented with shaded boxes in the figure.
A. HMB ACTIVATOR
The HMB activator resides within the SSD controller and plays a role in communicating with a host system to enable the HMB feature at the initialization stage. When the Identify command arrives from the host during the initialization of the storage device, the HMB activator sends a reply with the non-zero value of HMPRE attribute to indicate that the SSD supports the HMB feature. Then, the host system sends a Set Features command with the feature identifier 0xD, which denotes that the host system allocates a portion of the memory for HMB and it is now available to underlying storage. This command includes the direct memory access (DMA) addresses and the size of the allocated HMB space [7] , which is needed for the SSD controller to access the host memory as intended.
B. HMB SPACE MANAGEMENT
The HMB allocator is in charge of managing the HMB memory space. In our framework, the HMB space comprises a group of physically continuous chunks of memory called segments, as shown in Fig. 3 . For example, if the space allocated for the HMB is 64 MB, the host device driver divides it into 16 segments, where each segment is a 4 MB-sized continuous region. Upon a request, the HMB allocator allocates or releases the memory space at the memory block granularity (i.e. 4KB) within a segment.
The addresses and sizes of all segments are acquired at the storage initialization time through the HMB activator communicating with a host system. Once all of the necessary information is obtained, the HMB allocator initializes the HMB space by properly setting the associated data structures, which are outlined in Fig. 4 . As an example, the HMB control block, located in the first segment, maintains the metadata associated with the entire HMB space, such as the location of the memory block table and the fast write buffer table. Table 1 summarizes the controller-level APIs provided in our framework to allow it access the HMB space. When the SSD controller needs a memory block with a specific size in the HMB space, it invokes the hmb_malloc() or hmb_calloc() function. The HMB allocator, called by these routines, allocates the memory space by using the best-fit algorithm that selects the smallest free fragment that is greater than the requested size. If any adequate fragment is found, the HMB allocator creates a new memory block entry in the memory block table, which is associated with the allocated space. The memory block entry contains the segment number, offset in the segment, and size. Finally, the HMB allocator returns the memory address of the allocated space to the caller. It is noteworthy that the HMB space mapped to the SSD can be accessed only by one operation: read or write. This constraint is imposed by the fact that the mapped address by the DMA mapping functions in QEMU, a full-system emulator that HMB-SSD relies on, can be accessed only for a single type of operation. For this reason, the space allocation function returns with two memory addresses to the SSD controller: hmb_read indicating the read-only space and hmb_write pointing the write-only space in a host memory. Under this structure, when the storage controller reads data from the HMB space, it uses the hmb_read address (e.g., value = * hmb_read) and when it writes data to the HMB space, it uses the hmb_write address (e.g., * hmb_write = value).
The hmb_free() function is invoked when the SSD wants to release the allocated memory space. This function deallocates the memory space at the given address and deletes the associated memory block entry from the memory block table.
C. FAST WRITE BUFFER MANAGER
One essential advantage of the HMB feature lies in that it offers the memory space accessible both to the host and device. This property extends the communication between the host and device beyond the conventional storage protocols, significantly enhancing the efficiency and flexibility of the storage interfaces. With the HMB feature, thus, the crosslayer optimizations become possible without violating the abstraction of the storage systems.
We augmented our framework to utilize this advantage by implementing the FWB manager. The FWB manager enables the underlying storage device to leverage the host-side memory as its write buffer. This flexible memory access control between different abstract layers goes beyond merely limiting the performance penalties introduced by a constrained memory capacity of the storage, and provides a short communication channel between two ends.
Under the conventional storage interfaces, the I/O requests should have a long journey across multiple layers of the storage systems. As an example, in Linux kernel, I/O requests are created by using the bio structure in the block I/O layer after passing file systems such as EXT4. They are sent to the submission queues of the NVMe interface via software queues and hardware dispatch queues in the block I/O layer and then finally passed to the SSD controller. In the controller, they are usually first written to a write buffer in DRAM and then flushed to the NAND flash memory later.
As opposed to this mechanism, given the HMB-based storage buffer, the host system is able to pass I/O requests into the storage buffer directly, bypassing all of the in-between layers. The FWB manager is an implementation of this shortcut I/O interface in our development framework. When the HMB is activated by the host, the FWB manager allocates a certain capacity of the storage write buffer within HMB space according to the request of the storage controller. The FWB manager is embodied with two sub-components: the host-side FWB manger integrated into the block I/O layer and the device-side FWB manager running within the SSD controller.
When a read request arises, the FWB manager investigates the HMB write buffer first to check whether the requested data resides in it. If the request hits in the buffer, it is serviced by the data in buffer; otherwise, the data are read from underlying storage through the conventional storage interface. In contrast, the write requests are serviced by the HMB write buffer without reaching out to storage in all cases. If the previous version of the requested data exists in the HMB write buffer, the FWB manager overwrites it; otherwise, the FWB manager allocates a free space and writes the given data in the newly allocated space in the HMB write buffer.
Because the capacity of the HMB write buffer is limited, the reclamation of in-use buffers is needed eventually when it runs out of space. For reclamation, our framework maintains additional data structures including the FWB table and Least Recently Used (LRU) lists in the HMB space, as shown in Fig. 4 . The FWB table comprises buffer heads for maintaining each buffer, which have the segment number of buffered data, offset in the segment, logical page number, etc. For efficient replacement of the in-use buffers, we use two LRU lists: the LRU all list maintains all buffers in the recency order, and the LRU clean list maintains only clean buffers in the recency order. The LRU clean list is used by the host-side FWB manager because it can evict only clean buffers and because it is more efficient to look up the victims for eviction in the LRU clean list rather than the LRU all list, which contains both clean and dirty buffers. As dirty buffers should be flushed eventually to the underlying storage, which cannot be done by the host-side FWB manager, the controllerside FWB manager evicts the dirty buffers for victims by using the LRU all list.
The host-side FWB manager is invoked by submit_bio(), which processes I/O requests contained in a bio structure of the block layer. As the write buffer is managed in the page units from the controller whereas the bio structure is represented by sectors in the host, the addresses of the sectors for bio are first translated to logical page addresses. In Fig. 5 , the sectors for bio are from s 2 to s 9 , and thus, they are mapped to two logical pages, lp 0 and lp 1 , because the sector size is 512 B and the page size is 4 KB in this example. It then divides the bio structure into bio hit for already buffered data and bio miss for unbuffered data. For read requests, bio miss are processed in the original I/O handling manners by invoking generic_make_request(). However, processing bio hit is more complicated. The host-side FWB manager finds the buffer heads associated with the sectors in bio hit by using a hash table. It then reads the requested data from the write buffer and updates two LRU lists. Finally, the handling of read requests is finished by invoking bio_endio(). In this case, a process that issues the read requests would be not blocked.
Write requests are processed in a similar manner as read requests. The sector addresses for bio are translated into the logical page addresses, and then the bio is divided into bio hit and bio miss . For bio hit , it identifies the buffer heads by using the hash table, writes data into the HMB space, updates two LRU lists, and invokes bio_endio(). For bio miss , because it needs space to write data, it first checks if the write buffer in the HMB space has sufficient space to buffer the requested data. This can be determined easily if we monitor the total number of buffers, FWB total , and the number of dirty buffers, FWB dirty . If there are sufficient empty buffers for bio miss , it simply allocates new buffer heads in the FWB table, writes the requested data into the write buffer, and updates two LRU lists. Otherwise, before writing the requested data, items from the LRU buffers in the LRU clean list are repeatedly evicted until sufficient space to write data for bio miss is obtained.
If there are insufficient clean or free buffers to accommodate data to be written for bio miss , dirty buffers should be flushed to the NAND flash memory. In such a situation, as the host-side SSD controller cannot flush dirty buffers directly, the write requests are passed to the SSD controller through an original I/O handling process. When the SSD controller receives the write requests, the controller-side FWB manager repeatedly flushes the dirty buffers into the NAND flash memory from the LRU buffers in the LRU all list to free space. Once the free buffers are sufficiently obtained via the reclamation, it allocates new buffer heads into the FWB table, writes the requested data into the write buffer, and updates two LRU lists.
IV. PERFORMANCE EVALUATION
The proposed HMB-supported SSD development framework (HMB-SSD) is implemented by modifying a virtual SSD controller based on QEMU 2.9.0 and Linux kernel 4.13.10 [11] . Table 2 shows the hardware specifications and simulator configuration used in our experiments. In all experiments, we used an O_DIRECT flag to bypass various caching layers in the host and send the requests to the block I/O layer directly [18] .
To demonstrate the proper operation and performance benefits of HMB-SSD, we evaluated I/O performance in HMB-SSD when the underlying SSD controller uses a host memory as a write buffer (FWB) and does not (ORG). To investigate the maximum benefits of the FWB scheme, we first allocated a large space of the host memory as a storage write buffer to service all requests in it. For workloads, we used a simple micro-benchmark that generates 10,000 of 512B-sized sequential read and write requests respectively. In Fig. 6 , the average latencies of the read and write requests are reduced by 95% and 99% respectively when they are directly serviced by the FWB. This large performance enhancement is achieved by eliminating a huge performance delay that will be caused if the I/O request went through the multiple I/O queues in conventional I/O stacks via the NVMe interface. Fig. 7 and 8 present the performance evaluation results carried out with the real-world workloads. We used the MSR Cambridge workloads collected in the block I/O layer of running servers for real-world configurations [17] , [19] . We studied 13 workloads that have offset ranges operated in our framework (Table 3 ). Table 3 also shows the ratio of read and write operations in terms of the number of requests and the total I/O traffic. With these workloads, we measured the hit ratio and the average latency when the write buffer in the HMB space increases. Fig. 7 shows the hit ratios of read and write requests when varying the HMB buffer size. These results are collected from the host-side FWB manager. When the HMB buffer size is 8, 32, 128, 512, and 2,048 MB, the average hit ratios of the read requests are 2.1%, 3.5%, 7.4%, 13.4%, and 14.7%, respectively. On the other hand, the average hit ratios of write requests are 43.3%, 48.2%, 84.2%, 97.7%, and 100.0%, respectively. Hits in write requests indicate that the requested data exists in the write buffer or there is sufficient free space in the write buffer to buffer the requested one. In particular, for the HMB buffer sizes from 32 MB to 128 MB, the hit ratio is dramatically improved. This is due to the small working set size of workloads used, which can lead to more hits with less HMB space. When the HMB space size is 512 MB, almost all write requests from all the workloads can be buffered. Thus, we can conclude that a 512 MB-sized HMB space is sufficient to buffer almost all write requests in our experiments. Fig. 8 indicates that the average I/O latency and runtime of workloads are also improved significantly. Because multiple I/O stack layers are bypassed in the FWB scheme, the I/O latency decreases accordingly. Moreover, because the HMB write buffer intrinsically reduces the storage I/O traffic by servicing many requests within a host side, the performance of SSD can be improved with fewer requests. Compared to when the HMB write buffer is not used, the average I/O latencies of all workloads are improved by 59.0%, 63.2%, 78.5%, 85.4%, and 86.2% for HMB buffer sizes of 8, 32, 128, 512, and 2,048 MB, respectively. Similarly, the average normalized runtimes of workloads are improved by 64.6%, 68.0%, 83.0%, 89.9%, and 90.9%, respectively. In the results for prxy_0, the I/O latency is greatly improved with only 8 MB of HMB write buffer. We believe that this is because the workload is write-intensive and has a high locality. In this workload, 78% of the data is accessed more than once, and 43% of the total requests are for only 1% of the entire data. On the other hand, the I/O performance of hm_1 is not improved in terms of I/O latency and runtime because it is read-intensive. 
V. CONCLUSION
In this paper, we presented an extended SSD development framework called HMB-SSD, which supports the HMB feature over the NVMe interface. HMB-SSD provides a set of functionalities and APIs that readily enable the storage controller to make use of a host memory for various purposes. We also demonstrated the correct operation of our framework and the effectiveness of the HMB feature in storage systems with various experiments using micro and macro benchmarks. When using a host memory as a storage write buffer, the performance is improved by 59.0% to 86.2% across workloads. As a future work of this study, we will investigate further optimization techniques to utilize the HMB space in a more cost-effective way. Our future work will include the sophisticated buffer replacement policy and/or the dynamic HMB size adjustment with respect to the workloads.
