Missions, both near Earth and deep space, are under consideration that will require data recorder capacities doubled at a rate of approximately every three years. This challenge for ever-increasing mass storage also exists in other applications, such as unmanned aerial vehicle (UAV) and echo recording for phased array radar (PAR). All these scenarios call for storage devices with larger capacity, higher I/O bandwidth, lower latency and smaller size. In this paper, we combine Field Programmable Gate Array (FPGA)-based efficient cores of the emerging Non-Volatile Memory express (NVMe) protocol with Flash storage to improve the I/O bandwidth and latency from the operating system (OS) storage I/O software stack. We provide an alternating operation scheme to guarantee consistency of I/O bandwidth. The device has two independent optical fiber channels to ensure the reliability of interconnections and four NVMe flash storage recording data respectively at the same time, which increase its integration and scalability. The prototype has a capacity of 8TB and a volume of only 990 cubic centimeter, weighing only 2.2 pounds. Experimental results demonstrate that the continuous I/O bandwidth of each channel is above 1GBps with variance no more than 7% for its total capacity, and NVMe host logic core achieves up to 88% lower latency against the OS-based system. INDEX TERMS FPGA efficient core, NVMe, OS I/O stack, optical fiber.
I. INTRODUCTION
The demand of high I/O bandwidth and low latency storage capacity is much more than ever before due to constant data growth in the fields of space exploration [1] , PAR (phased array radar) experiment [2] and UAV (unmanned aerial vehicle) reconnaissance [3] . As technology develops, emerging storage technologies like NVMe (non-volatile memory express) protocol over PCIe provide high throughput and low latency for SSDs (solid state drive). NVMe SSDs are widely used in embedded devices due to their high speed and small physical size. For example, [4] combines NVMe SSDs with embedded high-speed X-ray imaging spectroscopy system for solar observations, [5] provides a compact NVMe storage device based on Zynq.
ARM-based NVMe embedded storage devices is hard to exploit the high speed of NVMe SSDs. These devices cannot reach a high I/O bandwidth because too much time is spent on the operations of the I/O software stack inside operating The associate editor coordinating the review of this manuscript and approving it for publication was Junxiu Liu . system (OS). The latency derives from the data transmission through user application, virtual file system, block layer and NVMe driver layer in Linux kernel. Recent works [6] , [7] has focused on optimizing the software stack and NVMe driver to get a low latency, but these approaches are more effective on high performance data center servers rather than embedded devices with low power consumption processors. [8] provided ''FastPath'' to optimize the I/O bandwidth and latency of NVMe embedded prototype, but the integration level of this device is limited by the only one PCIe hard core in Zynq.
In this paper, we implement all the NVMe Host functions on the FPGA. The logical efficient core in FPGA transforms the user commands to the NVMe command entries, and then write these entries into the submission queue. The efficient core also manages completion entries and masks the MSI-X interrupt. In addition, a NVMe command processing controller is designed on the FPGA fabric to replace the NVMe driver in OS. All the NVMe SSD functions are hosted by the logical core, which brings the NVMe SSD speed potential into play. Combined with the FPGA-based efficient cores and NVMe SSDs, an optical fiber storage device is designed. With the high throughput optical fibers and high I/O bandwidth NVMe SSDs, this device reaches a high storage and read speed above 2GBps for its total capacity of 8TB. To improve the flexibility of the SSD, the data store and read paths are divided into two channels, and each channel can work independently. Using compact components such as LCC (leadless chip carriers) packaged optical modules and M.2 NVMe SSDs, the storage device achieves a compact volume. In addition, a forced aircooling system is used to keep the temperature of the device within an appropriate range.
The rest of this paper is organized as follows: Section II introduces the NVMe storage devices based on Linux OS. Section III presents the FPGA-based efficient core and explicates the hybrid scheme fusing alternating operation and consecutive command submission. Section IV describes the design of optical fiber SSD including hardware implementation and logical framework. Section V exhibits the performance evaluation. Section VI concludes our works and proposes the future works.
II. PROBLEM FORMULATION
The NVMe storage device based on Linux OS and Xilinx ZC706 development board only reaches a throughput of 142.0MBps for sequential read and 128.4MBps for sequential write. The throughput of this device is limited by the speed of numerous steps in OS kernel for accessing NVMe devices. To improve the performance of NVMe devices, it is necessary to make sure that the control architecture of NVMe SSDs is not the bottleneck.
The NVMe host architecture in Linux OS has two layers: the software stack encapsulates user commands as NVMe submission entries, and the NVMe driver controls NVMe SSDs according to the command processing protocol [9] . In Linux software stack shown in Fig. 1 , user requests from user applications are transmitted to the block layer through the File System or the IOCTL function. In the block layer, these requests are first sorted and merged by IO scheduler, and then added to the I/O request queue, waiting to be read by the NVMe driver. The NVMe driver allocates resources to build the NVMe submission entries, and sends them to the NVMe devices. In addition, the NVMe driver decides to release the IO resources or reissue the submission entries according to the command processing status in the NVMe completion entries received from NVMe devices.
The storage I/O software stack in Linux OS brings about large latency due to its numerous steps for accessing devices. Table 1 provides the specifications of evaluation platform and Table 2 shows the performance experimental results using Flexible I/O benchmark [10] . The results indicate that the NVMe system running on ZC706 development board cannot take advantage of the speed of the NVMe SSD, and about 90% of the total latency during the read and write process is caused by the software stack in Linux OS. This is because the processing speed in software stack is limited by the number of threads and the clock frequency in ARM Cortex A9 processors.
III. EFFICIENT CORE SCHEME BASED ON FPGA
As demonstrated in Section II, the NVMe system based on Zynq and embedded Linux cannot reach a high I/O bandwidth and low latency. To improve the performance of portable NVMe storage devices, we designed the FPGA-based NVMe efficient core and alternating operation scheme.
The FPGA-based NVMe efficient core is designed to realize the functions of software stack and NVMe driver in OS. This efficient core achieves low latency by reducing the time of NVMe command submission and completion phases. The alternating operation with two NVMe SSDs guarantees the consistency of I/O bandwidth, which insures the stability of instantaneous I/O bandwidth. 
A. RELATED WORK
As stated in section II, the I/O bandwidth of OS based NVMe SSD is constrained by the latency inherited from the I/O stack. To reduce this latency, some researchers try to optimize the I/O stack. NVMeDirect, a new storage I/O framework was proposed to improve the storage performance [11] . By allowing applications to access storage directly without any hardware modification, the proposed framework improves the small file I/O performance by 12.5% and the real-world mobile workload performance by up to 20%. Actually, severe performance fluctuations and thus worstcase performance are serious problems for NAND-based storage mainly because of the erase-before-write restriction. To address this problem, [12] proposed a hybrid mapping flash translation layer policy using dynamic active log pool, partial merge, and moving valid pages to reduce the worst-case latency of write requests. Experimental results show that the average latency is shortened by up to 4.1%. FPGA-based NVMe Flash efficient core was exploited to implement a persistent caching mechanism for Apache Cassandra by replacing the DRAM cache to reduce the deployment costs [13] . This strategy provides Apache Cassandra with access to a large cache layer at lower cost. Concerning the demands of energy efficiency and computing flexibility of heterogeneous computing, Zhang and Jung [14] proposed a data-processing accelerator that self governs heterogeneous kernel executions and data storage accesses by integrating many flash modules in lightweight multiprocessors. Kwon [15] proposed to develop byte-addressable non-volatile memory based storage system as an alternative to NAND flash memory. The optimization of I/O parallelism of static and dynamic address mapping algorithms were compared and analyzed.
It's commonly recognized that the I/O latency of NVMe based storage that limits its deployment for real time applications. So most researchers devote to reduce the latency by optimizing the I/O stack or seek for some alternative storage media. There are few references tackling FPGA-accelerated NVMe. However, NVMe efficient core based on FPGA is an attractive and challenging way of improving the I/O latency of NVMe based storage. In this paper, we try to bridge this gap.
B. NVME EFFICIENT CORE BASED ON FPGA
The logical structure of FPGA-based NVMe efficient core shown in Fig. 2 consists of PCIe Root-port module, Command Pre-processing module, Submission module, Data Interaction module and Completion module.
The PCIe Root-port module realizes data transmission between FPGA and NVMe SSDs using Xilinx integrated blocks for PCIe. These blocks have various functions for PCIe physical layer and data link layer, interacting with user logic by AXI-Stream data bus [16] .
The Command Pre-processing module receives user requests and converts them to standard command form. The opcode, start address and length of user requests are generated in this module. For example, a continuous read/write enable signal is split into read/write sub-commands with multiple segments, each sub-command has its unique information. This module is used to replace the processing stream through IOCTL and Block Layer in OS.
The Submission module and Completion module realize the partial functions of NVMe driver, respectively. These components are responsible for allocating resources and managing NVMe submission and completion entries. Submission module caches a command submission entry and refresh the submission queue tail doorbell in NVMe device, then the submission entry is read by the NVMe device using memory read PCIe request. A command completion entry is received and temporary stored in the completion module, then the check phase in this module decides to release resources or report error information by judging completion status. These modules set up the submission and completion paths between user application and NVMe device. In addition, the completion queue is checked by polling instead of MSI-X interrupt, which simplifies the design of completion module.
Unlike conventional DMA (direct memory access) design with extra DDR (double data rate synchronous dynamic random-access memory) outside FPGA [17] , the Data Interaction module uses FIFO (first input first output) to cache data to be stored or received. Thus, the hardware become simple including components scheme and printed circuit board design. Experimental results in Section V indicates that the data caching scheme with FIFO inside FPGA is effective.
The command processing flow in the NVMe efficient core is simpler and faster than that in the ARM-based NVMe system (baseline). Steps through file system to NVMe driver in the OS software stacks are implemented in the NVMe efficient core. The management of command entries is achieved by entry process phases (including Generate Submission Entry phase and Check Completion Entry phase) and Write DB phase in NVMe efficient core, rather than through the file system, the page cache and device mapper, the block device and the I/O scheduler in OS software stacks. The transmission latency of control signals and data between registers and PCIe hardcore is faster than that between NVMe driver and underlying hardware. The register-level operations and simplified processing in NVMe efficient core reduce the latency from user application to PCIe hardcore. The completion queue status is checked triggering by new completion entry instead of MSI-X interrupt, which reduce the latency of the command processing architecture.
C. ALTERNATING OPERATION SCHEME
The NVMe controller on the NVMe SSD manages Flash Translation Layer (FTL), which establishes mapping relationship between SSD logical block address and Flash physical address [18] . In general, the FTL is cached in SDRAM of NVMe SSD, refreshed by NVMe controller and stored in the Flash. When the new mapping relationship in SDRAM of NVMe SSD reaches a threshold, the corresponding FTL is stored in Flash memory [19] . During the process of refreshing FTL, the data interaction between NVMe SSDs and NVMe host is interrupted for about 1∼3 milliseconds. Thus, the instantaneous I/O bandwidth of NVMe storage devices will be reduced and the latency of write command will increase. This extra latency is so large that the data cache FIFO inside FPGA overflows. In addition, the threshold is not a fixed value, we only know that the threshold is larger than a certain value. Based on above all, the alternating operation scheme is designed to avoid the influence of this extra latency. In the alternating operation scheme, as depicted in Fig.3 , two NVMe SSDs process write command alternately. The activated one is set online whereas the other one is set offline. When the amount of data continuously written into one SSD reaches a threshold defined by user (for example 256GB), an interrupt generates and the online SSD will be shut down and set offline, and the other SSD switches its state to online and continues to execute write command. During the shutdown processing, the FTL is forced to write to Flash memory. For whole storage device with alternating operation scheme, the extra latency caused by refreshing FTL process will not interrupt the data interaction between NVMe SSDs and NVMe Hosts. Thus, this design guarantees the consistency of I/O bandwidth effectively. The low level address mapping is still managed by the NVMe controller in the individual NVMe SSD whereas the top address of the two coupled SSDs which is labeled by software running in FPGA are sequential and interleaved.
IV. HARDWARE PROTOTYPE OF THE NVME OPTICAL FIBER SSD
With the FPGA-based NVMe efficient core and the alternating operation scheme, a portable optical fiber SSD is designed to achieve high I/O bandwidth and low latency. This storage device has two independent optical fiber channels, each channel uses two NVMe SSDs to store and read data independently. In addition, a forced air-cooling system is designed to keep the temperature of the device within an appropriate range. 
A. HARDWARE SCHEMATIC AND IMPLEMEMTATION
The schematic of the optical fiber SSD is shown in Fig. 4 . In this design, four Samsung 970 EVO NVMe SSDs are hosted by a single FPGA (XC7VX485T) which has 48 GTX transceivers and 4 integrated blocks for PCIe (PCIe hardcore) [20] . Optical modules supported 12.5Gbps data transaction speed are designed for data transmission and requests reception. Two LCC optical modules are used to receive data to be stored, and one LCC module is reserved. One QSFP+ optical module is used to receive request from user application and export data from the NVMe SSDs to host computer. In addition, status information of device is stored in the SPI Flash. Fig. 5 shows the printed circuit board of the optical fiber SSD with dimension of 3U (100 * 160mm). The left is the top side view and the right is the bottom side view.
The power supply scheme consists of several discrete modules. Fig. 6 illustrates the details of power supply scheme, including module type, voltage and power up sequence. The LMZ31710 modules supply power for four NVMe SSDs, three optical modules and other power modules such as LMZ30606, TPS74401 and TPS74701. The LMZ30606 buck converters supply power for the kernel and I/O banks of FPGA, and the TPS linear regulators provide low-ripple voltage output power for the GTX transceivers on FPGA. To achieve minimum current draw and ensure that the I/Os are three-stated at power on, we use the Enable and Power Good signals of power modules to implement the recommended power-on sequence of FPGA [21] .
B. LOGIC FRAMEWORK
The firmware in FPGA is responsible for interface interconnection and storage module management, and its logic framework is shown in Fig. 7 .
Aurora 8b/10b protocol is used for external interface of the device, which is designed for Xilinx FPGA and reaches a bit rate of 12.5Gbps with four lanes [22] . Considering protocol and coding overhead, the data payload transmission speed of Aurora can exceed 1GBps with bit error rate below 10 −14 . The total capacity of data caches module consisting of Block Ram FIFO is 2MByte, which occupies about 44 percentage of total Block Ram resources in FPGA. Cooperating with NVMe efficient core and the alternating operation scheme mentioned in Section III, this data cache module can guarantee the integrity of data during full-speed storage procedure. Requests processing module receives and caches user requests from user application, and channel selection phase in Aurora module determines which data channel is valid to send to the host computer. NVMe efficient core and SPI controller host NVMe SSDs and QSPI Flash, respectively.
Logic resource utilization of NVMe efficient core and the NVMe optical fiber SSD are shown as Table 3 . The less resource utilization means less power dissipation and low temperature.
C. THERMAL MANAGEMENT AND CASE DESIGN
Thermal management directly influence the performance and lifetime of electronic devices. To ensure that the NVMe optical fiber SSD adapts to a high range of environment temperature, the power consumption is evaluated firstly. Using the power analysis tool in Vivado, we can easily obtain the power consumption of FPGA as Fig. 6 shows. Other main power components are NVMe SSDs (5W * 4) and optical modules (1.5W * 3). Benefiting from the alternating operation scheme, there are always two NVMe SSDs keeping shutdown and the related GTX transceivers keeping idle. Combining with the efficient of power supply scheme shown in Table 4 , the total power consumption of the NVMe optical fiber SSD is 37.2W.
To ensure that all components work within an appropriate temperature range (−20 ∼ 85 • C), a forced air-cooling fan is fixed on the portable case as Fig. 8 shows. The required cooling airflow is caculated using following formulate.
where the parameter c PL and ρ L approximately equal to 1010 (J * K/kg) and 1.29 (kg/m 3 ), respectively. Normally the temperature difference between inlet and outlet easily reaches 10K. Bringing the 40W power consumption into this formulate, the required cooling airflow is 11.05 (m 3 /h). Thus, we chose the DC axial fan of 622 series (air flow more than 21 m 3 /h) for air-cooling [23] . 
V. EXPERIMENTAL RESULTS
We did experiments to test the I/O bandwidth, IOPS and latency of the NVMe efficient core prototype. The data stream is directly forwarded to the NVMe SSD in the proposed efficient core without any OS to guarantee the I/O consistency. The I/O bandwidth of optical fiber SSD with NVMe efficient core is measured, and the consistency of I/O bandwidth is evaluated and analyzed. In addition, temperature curves from measurement results indicate that our forced air-cooling system is effective. Finally, we emphasize the advantages of the optical fiber SSD with NVMe efficient core.
A. I/O BANDWIDTH, LATENCY AND IOPS OF NVME EFFICIENT CORE
The NVMe efficient core prototype is operating at 125MHz, referring to the clock generated by Xilinx PCIe hardcore. The PCIe Root Complex in the NVMe efficient core is generated by Xilinx Integrated Block for PCIe Gen2 X4, which has deep influence on the performance of the NVMe efficient core. The Samsung 970EVO SSD is used for performance experiments, which has a I/O bandwidth of 2.1GBps and 3.5GBps for sequential write and sequential read with PCIe Gen3 X4, respectively. Fig. 9 illustrates the I/O bandwidth of the NVMe efficient core. Experiment results are obtained for read and write with different block size. The NVMe efficient core is stable at a higher I/O bandwidth of both 1.2GBps for sequential read and write with 128KB block size, which throughput is about 10 times that of baseline (142.0MBps for sequential read and 128.4MBps for sequential write).
The IOPS (I/O operations per second) of the storage system is related to its QD (queue depth) and its addressing speed. The IOPS of the NVMe efficient core shown in Fig. 10 is higher than that of the baseline, which means better random read and write performance.
Command processing latency is defined as the time spent from user requests submission to completion. Fig. 11 shows the latency averaged of 60 times for read and write operations with different block size on NVMe efficient core. Compared with the results in Table 2 , the NVMe efficient core achieves up to 80% lower latency against the baseline system with 128KB block size. The results benefit from the usage of NVMe efficient core instead of embedded Linux OS with I/O software stack.
For further performance improvement, following methods can be implemented on hardware and logic. Upgrade the PCIe generation 2 to PCIe generation 3 to increase the throughput of data bus. The queue depth of NVMe efficient core is enlarged to take advantage of the high IOPS of NVMe SSD. Also, the logic timing optimization must be done to enhance the utilization rate of AXI-Stream data bus between NVMe efficient core and PCIe hardcore. 
B. I/O PERFORMANCE OF NVME OPTICAL FIBER SSD
The I/O bandwidth of NVMe optical fiber SSD is extra limited by the throughput of optical modules. The sequential read and write I/O bandwidth of each channel of optical fiber SSD can reach 1GBps with 1MByte data cache FIFO inside FPGA. By monitoring the Full signal of data cache FIFO before NVMe efficient core, we can judge whether the write I/O bandwidth is stable under a data source with a certain speed. Fig. 12 indicates the consistency of I/O bandwidth under the sequential write speed of 1GBps. Without the alternating operation scheme, the Full signal pulls up for several times for 2TB capacity of single NVMe SSD according to the FTL refresh rule. Using this alternating scheme, the Full flag signal keeps low for the total 4TB capacity of one channel with two NVMe SSDs.
C. TEMPERATURE OF NVME OPTICAL FIBER SSD
The temperature of FPGA and NVMe SSDs influences both their speed and lifetime. Fig. 13 shows the thermal map of the printed circuit board obtained by the FLIR ONE PRO LT thermal imaging camera. The left is the top side view and the right is the bottom side view, which is the rotated view of Fig.5 by 90 degrees to the right. This thermal map indicates that the active and online NVMe SSDs and FPGA are the main heat sources of the optical fiber SSD. Fig. 14, 15 show the temperature curves of FPGA and NVMe SSDs with natural convection and forced air-cooling system, respectively. Under natural convection conditions, the temperature of NVMe SSDs exceeds 75 • C easily, which triggers the temperature protection and leads to speed reduction [24] . The forced air-cooling system guarantees that the temperature of NVMe SSDs keeps below 65 • C. The temperature of NVMe SSDs rises and falls repeatedly within an appropriate range, which benefits from the alternating operation scheme. The difference of the average temperature between SSD1 and SSD4 is about 10 • C, because the SSDs located in different positions of the PCB.
D. DISCUSSION OF THE NVME OPTICAL FIBER SSD
The NVMe optical fiber SSD has two independent data channels, and each channel can write and read at a speed up to 1GBps for its capacity of 4TB, meeting the data storage requirements of high-speed acquisition system. With hybrid scheme fusing alternating operation and consecutive command submission, there is no extra SDRAM on the storage card, simplifying the implementation of logical framework and hardware scheme. The device has a volume of only 990 cubic centimetre and weighs just 2.2 pounds, which is portable and can be adapted to work in a narrow space. In addition, the device can work in a wide temperature range environment, benefiting from the forced air-cooling system and the alternating operation scheme.
VI. CONCLUSION
In this paper, we introduced the FPGA-based NVMe efficient core to improve the I/O bandwidth and reduce latency from the OS storage I/O software stack, and we proposed a hybrid scheme fusing alternating operation and consecutive command submission to guarantee the consistency of I/O bandwidth. After that, a portable optical fiber SSD was designed both on firmware and hardware. Evaluation results demonstrate that the continuous I/O bandwidth of both independent channels is above 1GBps for its capacity of 4TB, and the NVMe efficient core achieves up to 88% lower latency against the embedded OS-based system. In the future, we plan to implement the file system such as FAT and Ext on FPGA to improve the flexibility of data management.
