Abstract-This paper presents remote direct memory access (RDMA) over the Ethernet protocol used for data acquisition systems, currently under development at the European Synchrotron Radiation Facility. The protocol is implemented on Xilinx Ultrascale + field-programmable gate arrays (FPGAs), thanks to the 100G hard media access controller (MAC) internet protocol (IP). The proposed protocol is fairly compared with the well-known RDMA over converged Ethernet (RoCE-V2) protocol using a commercial network adapter from Mellanox. The obtained results show the superiority of the proposed algorithm over RoCE-V2 in terms of data throughput. Performance tests on the 100G link show that it can reach a maximum stable link performance of 90 Gb/s with minimum packet sizes greater than 1 kB and 95 Gb/s for packet sizes greater than 32 kB.
FPGA Implementation of RDMA-Based Data
Acquisition System Over 100-Gb Ethernet Wassim Mansour , Member, IEEE, Nicolas Janvier, Member, IEEE, and Pablo Fajardo
Abstract-This paper presents remote direct memory access (RDMA) over the Ethernet protocol used for data acquisition systems, currently under development at the European Synchrotron Radiation Facility. The protocol is implemented on Xilinx Ultrascale + field-programmable gate arrays (FPGAs), thanks to the 100G hard media access controller (MAC) internet protocol (IP). The proposed protocol is fairly compared with the well-known RDMA over converged Ethernet (RoCE-V2) protocol using a commercial network adapter from Mellanox. The obtained results show the superiority of the proposed algorithm over RoCE-V2 in terms of data throughput. Performance tests on the 100G link show that it can reach a maximum stable link performance of 90 Gb/s with minimum packet sizes greater than 1 kB and 95 Gb/s for packet sizes greater than 32 kB.
Index Terms-100-Gb Ethernet, data acquisition, fieldprogrammable gate array (FPGA), Infiniband, remote direct memory access (RDMA) over converged Ethernet (RoCE).
I. INTRODUCTION
T HE progress in manufacturing technologies and processes results in a significant increase of produced data rates in modern and upcoming 2-D X-ray detectors. Such data streams are challenging to transfer, to manipulate, and to process in an acceptable time.
A generic and scalable data acquisition framework, called remote direct memory access (RDMA)-based acquisition system for high-performance applications (RASHPA), is currently under development at the European Synchrotron Radiation Facility (ESRF). It will be integrated in the next generation of high-performance X-ray detectors [1] .
One of the key and specific features of this new framework is the use of RDMA for fast data transfer. RDMA consists on the transfer of data from the memory of one host or device into that of another one without any CPU intervention. This allows high-throughput, low-latency networking. Companies are investing more and more into this feature, widely used in high-performance computing, by integrating it into their network cards and communication adapters. Some of the available technical solutions are Infiniband [2] , RDMA over converged ethernet (RoCE) [3] , and Internet wide area RDMA protocol (iWARP) [4] . RASHPA framework has been prototyped and concept proven in [1] where the data link was selected to be the peripheral component interconnect express (PCIe over cable) [5] . Despite the benefits of this link, for which the native RDMA feature is the most important, it presents major limitations in terms of small transfer packet size, limited availability of PCIe over cable commercial off-the-shelf products such as switches and adapters, and the lack of standardization for optical cabling form [6] .
The need to switch to a more standard networking scheme leads us to the RDMA over 100G Ethernet solution. RoCE and iWARP are two Ethernet protocols that use the Ethernet link layer. RoCE is a protocol developed by Mellanox, and it is based on the Infiniband specifications. It exists in two versions: the first one is a Layer-1 protocol with an Ethernet type 8915, whereas the second one, called RRoCE (routable RoCE), is layer 3, user datagram protocol (UDP)/internet protocol (IP) protocol, with Infiniband header inserted in the UDP data field. iWARP is another widely used RDMA over transmission control protocol (TCP)/IP supported by Chelsio. A comparison between both protocols as seen from the side of Mellanox and Chelsio is presented in [7] and [8] .
Both iWARP and RRoCE are heavy to be implemented on field-programmable gate array (FPGA) in terms of hardware resources as well as latency requirements. The first one requires a TCP/IP stack so discarded from the work performed in this paper and only RRoCE in its simplest and fastest mode called unreliable datagram (UD) is investigated.
The main objective of the work presented in this paper is to implement a dedicated data transfer interface over the Ethernet UDP protocol together with a direct memory access (DMA) over PCIe engine. The implementation of an ESRF RDMA over 100-Gb Ethernet solution is detailed. Two implementations should be considered, a front end (detector transmitter side) and a back end (computer receiver side). The front-end design is integrated within the RASHPA controller logic, whereas the back-end one is supposed to be plugged into the PCIe slot of the back-end computer intended to receive detector data. This paper is organized as follows. Section II briefly introduces the concept of RASHPA. Section III provides a background and discusses the FPGA implementation challenges of the RRoCE protocol. Section IV details the proposed RDMA over the Ethernet protocol. In Section V, experimental results as well as a comparison between the proposed RDMA protocol and RRoCE are presented. Conclusions and future perspectives are discussed in Section V.
0018-9499 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 
II. RASHPA CONCEPT
RASHPA allows detectors to push data [images, regions of interest (ROI), metadata, and events] produced by 2-D X-ray detectors directly into one or more back-end computers. RASHPA's main properties are its scalability, flexibility, and high performance. It is based on the RDMA mechanism to provide the maximum possible bandwidth per link. It is also intended to have an adjustable bandwidth that can be compatible with any back-end computer. Fig. 1 shows a block diagram of different RASHPA network schemes, where data (such as images, the region of interests, metadata, or events) are sent from a RASHPA capable detector to a back-end computer. This transfer is performed by the mean of one or several data links and can be point to point or in a routable network.
Within the RASHPA framework, we can consider two types of back-end computers. The first one is called the system manager (SM), and it is responsible for the initialization and configuration of the whole system. The second type is called data receiver (DR), which is intended to receive the detector data in its local memory buffers.
The usual data destinations are random access memory (RAM) buffers. Other possible destinations that are currently under investigation at ESRF are graphical processing units (GPU), coprocessors, and disk controllers.
From a hardware point of view, the RASHPA controller consists of specific logic interfacing the detector readout electronics as well as a set of hardware blocks handling data transmission. These blocks are known as channels. Two types of configurable channels can be identified in RASHPA: data and event channels.
Data channels are responsible for transferring detector data to a preconfigured address space within one or several data receivers. Multiple data channel instances can be implemented in a single RASHPA controller.
An event channel is responsible for informing the data receiver or system manager about any event occurring in the overall system. Typical events are errors, end of transmission conditions, and source memory overflow. Only one event channel is required to be implemented in a full RASHPA system.
RASHPA is independent of the data link used for transmission; however, a requirement that should be respected by the selected data link is the RDMA support feature.
III. RDMA OVER CONVERGED ETHERNET ROCE
Ethernet is a computer networking protocol introduced in 1983 and standardized as IEEE 802.3 [9] . It divides the data stream into shorter pieces called frames. Each frame contains source and destination media access controller (MAC) addresses, Ethernet type, data, and error-checking code for the frame data.
The Ethernet-type field specifies which protocol is to be included in the frame. IP is one of these communication protocols and is level 3 in the open systems interconnection (OSI) model which constitute the Ethernet communication standard. UDP is one of the essential communication protocols used by the IP protocol. The UDP frame consists of several fields in addition to the Ethernet header and the IP header: source port, destination port, length, checksum, and payload data.
RoCE [10] is an Ethernet protocol based on the Infiniband specification [11] , and available in two different versions: RoCE-V1 and RoCE-V2 or RRoCE. RoCE-V1 is an Ethernet layer nonroutable protocol, whereas the routable version RRoCE is the most interesting for RASHPA's implementation.
RRoCE is an RDMA capable, layer-3 network based on UDP/IPv4 or UDP/IPv6, and relying on congestion control and lossless Ethernet. It is currently supported by several offthe-shelf network adapters as well as the latest Linux kernel drivers.
The UDP payload data of an RRoCE protocol, illustrated in Fig. 2 [12] , contains an Infiniband header, the actual data payload in addition to an invariant cyclic redundancy check (iCRC) field that is mandatory for the RoCE packets in order to be accepted by the network adapter. The iCRC field is retained from the Infiniband specifications. Fig. 3 [2] shows the iCRC32 calculation algorithm. Note that an Ethernet frame does also contain another CRC field for the global packet.
The calculation of the iCRC algorithm for RoCE-V2/IPv4 is performed the following steps.
1) Extract RoCE-v2: IP + UDP + InfiniBand.
2) Add dummy LRH field, 64 bits of 1's. This field is present in the Infiniband specifications, so in order to have correct CRC calculation, one have to include its dummy bits. 3) For RoCE-v2 over IPv4, time to live = 1's, header checksum = 1's, and type of service (DSCP and ECN) = 1's. 4) UDP checksum = 1's. 5) Resv8a field on Infiniband protocol = 1's. 6) CRC calculation is based on the crc32 used for Ethernet networking, 0x04C11DB7. 7) CRC calculation is done over the UDP frame starting from the most significant bit of the most significant byte. 8) Inversion and byte swap have to be applied in order to get the invariant arc to be integrated in the RRoCE frame.
A first FPGA implementation trial of the RRoCE has been performed using the (UD) mode [3] . In this mode, data are sent in streams without any acknowledgment from the receiver side. The target FPGA board was the KCU116 by Xilinx [13] , and the target network adapter was a Mellanox ConnectX-4 (MCX415A-CCAT) board. It is important to note that in Ultrascale+ families, the 100G CMAC IP core is a hard IP having local bus (LBUS) as input-output, which have to be converted into AXI stream bus to be integrated in system-onchip designs.
In fact, the basic challenge in the FPGA implementation of UD only RRoCE algorithm is the optimal implementation of an iCRC algorithm. Fig. 4 depicts the timing diagram of the input stream data used for the iCRC calculation. Data of 64 bytes are streamed at each 3.125-ns clock cycle period except the last cycle that may contain partial data that requires multiplexing via the AXI stream "tkeep" signal for byte selection.
An iCRC design requires 64 clock cycles in order to calculate the iCRC over the 64-byte input. After 64 clock cycles, the design will be allowed to continue the calculation over the second 64-byte input data. This means that 200 ns are lost for each data calculation of 64 bytes. Supposing that the transmitter sends 12.5 GB (100 Gbits) of data, which will theoretically take 1 s to be transferred over a 100-Gb/s Ethernet link, the actual theoretical transfer delay caused by the iCRC calculation will be 42 ms that is 4.2%. 
IV. PROPOSED RDMA OVER RDMA PROTOCOL
RRoCE is a well-developed commercial protocol supported by the ib-verbs library available in the latest Linux kernels. However, one can even go faster in data transfer due to the iCRC calculation problem and the overhead used for the Infiniband header. In addition to the previously mentioned reasons, controllability and observability over an in-house developed protocol are a major advantage for an ESRF RDMA over the Ethernet protocol over RoCE.
The RDMA over the Ethernet protocol proposed in this paper will mainly use the UDP/IP protocol for routability and information about each transfer in the unused source and destination ports of the UDP header.
The proposed protocol relies on the interactions of three major actors. The first one is the RASHPA controller on the X-ray detector front-end side which is the data transmitter. The second one is the FPGA board acting as a data receiver, which will transform UDP packets coming from the transmitter into PCIe DMA-based packets. These packets are sent to dedicated buffers on the data receiver computer which is the third actor in the system. Fig. 5 illustrates the architecture of the overall system.
There will be a software library called LIBRASHPA installed on the data receiver side that will help allocating memory buffers of different sizes to be used as final data destinations. These buffers will be identified by an identification number (ID), a size, and the IP address of the data receiver, as depicted in Fig. 6 . The RASHPA controller, which is the transmitter, should have enough knowledge about these three parameters; however, the receiver FPGA board should store the real physical address of the allocated buffers for address translation. Fig. 7 shows the FPGA implementation of the Ethernet transmitter side using the Xilinx 100G cmac IP. Data streams coming from the detector are stored in a DDR4 memory. Whenever a full image is written to the double data rate (DDR), the RASHPA controller will configure a DMA IP allowing it to read the data via an AXI4 interconnect and sends it as stream of data (AXI stream bus) to the header insertion IP. The header insertion IP gets its configuration from the RASHPA controller. In fact, the configuration of the header insertion unit is nothing but the UDP header and the destination local buffer represented by the identification parameters stored at the initialization phase in an internal block RAM (BRAM). The constituted header will be concatenated with the data stream coming from the DMA. Since the CMAC IP has an LBUS input-output interface, a bridge between the AXIS to LBUS has been implemented and used as an intermediate stage between the header insertion unit and the CMAC IP. The configuration of the whole process can be done using the same Ethernet link or via an external link such as 1-Gb Ethernet and PCIe over cable.
At the receiver side, in Fig. 8 , the CMAC output data as LBUS are bridged to an AXI stream interface before it gets analyzed in order to resolve the physical address of the final destination buffer. Actually, during the initialization phase, LIBRASHPA should store the physical address of each local buffer in a BRAM inside the receiver's FPGA. The output data of the header analyzer unit can be stored in a DDR4 or FIFO for synchronization, and then sent to the PCI express endpoint for DMA transfer to the final destination. The whole process is controlled by a finite state machine implemented in the driver IP.
In order to guarantee the no packet loss, one can use a converged network, but in case of lost packets, the data receiver should be informed. For that, a simple packet loss detection algorithm has been implemented. It consists of a 1024-bit shift register. Each bit in this shift register represents one packet number represented by its sequence number. When packet sequence number "512" is received, the receiver checks packet "1," if it is missing, then it generates an event to inform the data receiver. The same process repeats for each received packet, which means that the receiver can identify a lost packet after 512 received packets. The process is illustrated in Fig. 9 .
V. EXPERIMENTAL RESULTS
The implementation of the proposed prototype as well as RoCE-V2, at the transmitter side, targets a Xilinx FPGA development board (KCU116). The board is based on the XCKU5P Kintex Ultrascale+ families. In the case of the proposed prototype, the receiver implementation targets an industrial board called XpressVUP developed by Reflexces [14] . It is based on a XCVU9P virtex ultrascale+ FPGA with an integrated Gen3x16 PCIe endpoint. The PCIe endpoint is comparable to the integrated one in the Mellanox network adapter card, MCX415A-CCAT, used as a RoCE-V2 back end. A UDP stack has been implemented on the transmitter FPGA allowing the RASHPA controller to construct frames of data and the back end to read these packets and analyze them before transforming them into DMA configurations. Postroute of the front-end (transmitter) FPGA implementation shows that the design occupies around 50% of the total configurable logic blocks and 21% of BRAM of the selected XCKU5P FPGA.
To confirm the correctness of the constructed packets and to test the transfer bandwidth, the Mellanox NIC was used together with wireshark software on a PC running on Linux debian distribution.
The realized experiments allow building correct UDP packets; however, the UDP receive buffer overloaded when measuring UDP bandwidth due to the high transfer rate without the ability to empty it. Hardware RoCE-V2 as well as soft-RoCE were also tested between two Mellanox boards running at 100 Gb/s.
In order to provide a fair comparison of the transfer throughput of both protocols, one should exclude the CRC implementation because it will terribly affect the transfer rate.
First of all, in order to have an idea about the transfer, one could achieve with the 100G link itself, FPGA-to-FPGA UDP transfers were selected. Different configurations of the MAC IP including packet sizes, and the number of packets to send were selected. Fig. 10 illustrates the obtained results and shows that the 100G transfer can reach a rate of 90 Gb/s for a minimum packet sizes of 1 kB and becomes stable at 95 Gb/s for packet sizes of 32 kB and above. Small packet sizes decrease significantly the throughput.
In Fig. 10 , several cases were taken into consideration depending on how the data are sent via the LBUS interface of the 100G IP. Since 4 × 16 byte busses compose the LBUS, we consider the best case when all of the four busses are filled with 16-byte data. Whereas the "worst case" is considered when only BUS-0 is filled with only 1 byte, i.e., there is a lost of 63 bytes in each transfer. The normal case is considered when there is a loss of 32 bytes as in average in each transfer. Finally, we consider the case where the packet ends and the next packet starts in the same cycle, which leads to a maximum loss of 15 bytes in the worst case scenario.
The throughput comparison between the RoCE and the proposed algorithm was based on preconstructed data packets of 598 bytes. The reason to select this amount of data is to be able to verify the correctness of the calculated iCRC-32 with software calculated one provided by web-based tools [15] . The same configuration was adapted for both algorithms where a computer was used to configure the DMA on the transmitter side for each transfer. Note that this is not the optimal throughput to measure because of the CPU interaction at each packet. This is done by CPU polling of the data in a specific address, and once it is "0xdeadbeaf," it stops a timer and calculates the bandwidth. Table I presents the measured bandwidth for both algorithms using the adopted strategy.
The results show that the proposed algorithm is more than 1.5 times faster than the RoCE-V2 protocol considering that the iCRC is precalculated and only the link is tested together with the receiver side, i.e., the Mellanox network adapter versus the FPGA implementation of the suggested protocol. Both receivers are connected via PCIe x16 lanes.
It is important to note that while performing these end-toend tests, either from one FPGA to another or from an FPGA to Mellanox board, no lost packets were detected.
VI. CONCLUSION
This paper presented a dedicated data transfer protocol based on RDMA over Ethernet. The protocol is intended to be used in the next detector generations that are under development at the ESRF. The implementation was realized on a KCU116 xilinx development board and compared with the commercial widely used protocol RoCE-V2 implemented on the same FPGA board and wired to a Mellanox network adapter connect-X4 board.
Comparison results show the superiority in terms of data throughput of the proposed protocol with respect to RRoCE even when excluding the iCRC calculation.
Future development will focus on the integration of both the proposed protocol and RoCE all together in the RASHPA framework. Selection between these protocols will be based on the price/throughput requirements for each detector application.
Testing the protocol over a routable network of detectors/back-end computers is the next goal of the project.
