Introduction
The ever-increasing demand for cloud services and Big Data imposes a constant increase of data centre size and complexity. Most of current data centre network (DCN) 1 architectures follow a multi-tier and fat-tree hierarchy. The infrastructure is based on commodity devices and equipment. One of the challenging issues when scaling out a data centre is its network infrastructure. There is considerable interest and effort in improving data centre network architectures 2, 3 , however, efforts lack modularity and flexibility to deliver required performance for disaggregated data centres 4, 5 . Innovative dataintensive applications require sharing of compute and memory/storage resources among large amount of servers. The disaggregation of server resources enables fine-grained resource provisioning that can be exploited by High Performance Computing (HPC) and other applications. Furthermore, recent research around cloud data centres shows over 80% of traffic originated by servers stays within the rack. 6 Thus, for DCNs, such as cloud data centres, special focus is needed on improving the intra-rack communication performance.
In this paper, for the first time, we propose a FPGA-based optical programmable switch and interface card (SIC) which could eventually replace the traditional network interface card (NIC), plugged into the server directly and enabling intense intra-rack blade-to-blade communication. We report on the design and implementation of the FPGA-based optical programmable SIC. And by using the SIC, we can enable a flat and scalable all-optical data centre inter-and intra-cluster architecture. The feature and functions of the SIC enables intrarack blade-to-blade direct interconnection and eliminates the electronic devices in the Top of Rack (ToR) switch, thus minimizing the intrarack latency, while it can be used as an optical NIC, moving data among blades and ToR. The SIC can also aggregate optical circuit switching (OCS) or optical packet switching (OPS) traffic and perform OCS-to-OPS and OPS-to-OCS conversion. Moreover, the SIC has the OPS/OCS switch function that can be used as OPS/OCS hop. We experimentally demonstrate a back-to-back testbed for the FPGA-based optical programmable SIC, the measurement results showing low intra-rack blade-to-blade latency (416ns).
DCN inter, intra-cluster architecture with FPGA-based optical programmable SIC The proposed FPGA-based optical programmable SIC enables the DCN inter-and intra-cluster architecture, shown in Fig. 1 data flows with ultra-low latency, and OPS can offer flexible bandwidth capacity for each optical link when facing dynamic and unpredictable traffic demands with either short or long lived data flows. The intra-cluster architecture is shown in Fig. 1b . The blades are all-to-all directly connected with each other in the rack, each blade is capable of communicating with the optical ToR switch, which could be wavelength selective switch (WSS), array waveguide grating (AWG) or routing AWG (R-AWG), through direct optical link of FPGA-based optical programmable SIC. For the intra-cluster DCN architecture, an architecture on demand (AoD) 7, 8 node interconnects all the input and output ports of different ToRs though the OCS and OPS modules, and traffic from/to other clusters as well, benefiting the flexibility and programmability on demand of AoD. While for the inter-cluster configuration, as shown in Fig.1a , a group of clusters are interconnected by an inter-cluster AoD. To fulfil the controlling mechanism enabled by software defined network (SDN) framework, each switching node and FPGA-based optical programmable SIC have been implemented with an OpenFlow agent that bridges the control plane with each data plane optical device.
FPGA-based optical programmable SIC design and implementation
FPGA-based optical programmable SIC is designed and implemented to replace the traditional NIC, plugged into the server through Peripheral Component Interconnect Express (PCIe) socket, supports both intra-rack blade-toblade communication and blade to optical ToR switch communication with the view to achieve high performance intra-rack evolving to interrack communication. The Hitech global HTG-V6HXT-X16PCIE was used for the prototyping, which features with Xilinx HX380T FPGA, SFP+ interfaces and Gen2 PCIe x8 interface.
As shown in Fig. 2a , besides the functionality of traditional NIC such as reading/writing data from/to the blade, sending/receiving traffic in protocol, the SIC is also capable of sending/ receiving hybrid OPS/OCS traffic, acting as an OCS switch, an OPS switch, and an OCS/OPS, OPS-to-OCS, OCS-to-OPS aggregation interface. The FPGA-based design and implementation, demonstrated in Fig. 2b , includes server interface, SDN agent interface, inter-rack and intra-rack interface.
For the PCIe interface with the blade, the FGPA-based optical programmable SIC card communicates with the blade through x8 lanes of Gen2 PCIe interface. The design employs a Direct Memory Access (DMA) engine to efficiently copy the data between the blade (i.e. memory DIMM) and FPGA-based on-chip RAM. For the interface with the control plane, the SDN agent sends commands encapsulated in Ethernet frames via a 10Gbps interface, with the same interface and method, the FPGA-based optical programmable SIC sends feedback with its status back to the SDN agent. When receiving the Ethernet frame from the SDN agent, the SIC updates its Look Up Table (LUT) with the commands, and the FPGA-based functional blocks follow the commands in the LUT to achieve certain functions.
There are two 10Gbps links implemented for hybrid OPS/OCS inter-rack communication. Based on the LUT, the traffic can be sent/received as OPS or OCS for inter-rack communication. When used as OPS/OCS switch, the received traffic is directed to the corresponding port without being processed and moved back to the blade.
There are also two 10Gbps links implementing OCS intra-rack communication interface. This implementation enables the intrarack blade-to-blade communication. Similar to inter-rack interface, when used as OCS switch, traffic can be directly forwarded to other OCS interfaces without returning back to the blade.
We designed and implemented a cut-through 
OPS/ OCS ToR

FPGA SIC Memory
ToR Switch all-to-all (intra-rack ↔ inter-rack) Blade option for port4 which eliminates all the storeforward delays in multiple FIFOs to deliver ultralow latency service for communication of disaggregated memory and processing blade.
ToR
FPGA-based optical programmable SIC testbed setup and experimental result
We tested the performance of FPGA-based optical programmable SIC using the testbed shown in Fig. 3 . The SIC is connected to the DELL Poweredge 710 server PCIe Gen2 socket through PCIe extension cable. We used Polatis 192x192 fibre switch as an AoD optical backplane. The SDN agent hosted in a server is connected with the controller agent interface of the SIC through SFP+ interface.
A traffic generator (Anritsu MD1230B) was used to generate the Ethernet traffic and feed the OCS port4 of SIC1. The port4 was set as cut-through mode. A traffic analyser was used to collect the results from the output port4 of SIC2.
When FPGA-based optical programmable SIC receives the Ethernet traffic, it processes the data, and enables DMA engine to move the processed data through PCIe to the server RAM. Then when RAM receives a full block of data, the DMA engine initiates the transmission from the server RAM and reads data back to the FPGA. The SIC processes the data and transmits them out.
In this experiment, we measured the maximum throughput and the latency. The maximum throughput measurement result is shown in Fig. 4a . For throughput measurement, as described above, data was written to RAM, and after the block of RAM was filled, data was read back. The maximum throughput is limited because of this non-duplex transmission. We measured latency on cut-through mode since cut-through FIFO (compared to store-forward FIFO) helps minimizing the latency. Fig. 4b shows the 3Gbps 64B traffic latency break down by DMA/PCIe latency, FPGA logic latency, FPGA PHY latency and optical path (20meters fibre) latency. From the chart, majority of the time were spent on the DMA/PCIe logic. This latency is mostly dependent on the DMA engine core, server CPU respond time and PCIe socket/cable quality. Without considering PCIe/DMA latency, we can get a minimum of 416ns latency for cut-through intra-rack bladeto-blade communication on 3Gbps 64B Ethernet frame traffic.
Conclusions
This paper reports the inter-and intra-cluster data centre network architecture by using FPGA-based optical programmable SIC. We demonstrated FPGA-based optical programmable SIC design, implementation and back-to-back throughput and latency results. The FPGA-based optical programmable SIC is featured with multi-functionality, flexibility and programmability. The measurement results show ultra-low latency (416ns) for intra-rack blade-to-blade communication. 
