A coordinated radio-resource scheduler with an FPGA-based hardware accelerator is a key component for 5G mobile systems in NFV environments. This paper analyses the scheduling process and addresses ways to reduce the overhead of memory copy operation between a central unit and the accelerator. The experimental results show that the overhead is reduced to approximately 14% when the accelerator is connected with a central unit via PCIe with high-bandwidth memory copy technique. Moreover, they indicate that the accelerator tightly coupled with central units via the Ethernet is also a possible approach for coordinated scheduling among multiple central units. This will be advantageous in implementing future NFV-based mobile communications systems.
[1] R. Mijumbi, J. Serrat, J. Gorricho, N. Bouten, F. De Turck and R. Boutaba, "Network function virtualization: State-of-the-art and research challenges," IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 236-262, Firstquarter 2016 .
[2] C.-L. I, J. Huang, R. Duan, C. Cui, J. Jiang, and L. Li, "Recent progress on C-RAN centralization and cloudification," IEEE Access, vol. 2, pp. 1030-1039, 2014.
[3] T. Okuyama, S. Suyama, J. Mashino and Y. Okumura,"Antenna deployment for 5G ultra high-density distributed antenna system at low SHF bands," 2016 IEEE Conference on Standards for Communications and Networking (CSCN), Berlin, pp. 1-6, November 2016.
[4] T. Seyama, M. Tsutsui, T. Oyama, T. Kobayashi, T. Dateki, H. Seki, M. Minowa, T. Okuyama, S. Satoshi, and Y. Okumura, "Study of coordinated radio resource scheduling algorithm for 5G ultra high-density distributed antenna systems," 13th IEEE VTS Asia Pacific Wireless Communications Symposium, Tokyo, S3-5, August 2016.
[5] Y. Arikawa, T. Sakamoto, and S. Kimura, "Hardware accelerator for
This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented.
coordinated radio resource scheduling in 5G ultra-high-density distributed antenna systems," 27th International Telecommunication Networks and Applications Conference (ITNAC), Melbourne, pp. 1-6, November 2017.
[6] Y. Arikawa, T. Sakamoto, and S. Kimura, "Throughput enhancement with hardware accelerated resource scheduler in 5G low latency systems," 29th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Bologna, pp. 1-6, September 2018.
[7] 3GPP TS 36.321 "Medium Access Control (MAC) protocol specification," V12.5.0, March 2015.
[8] DMA for PCI express (PCIe) subsystem, https://www.xilinx.com/products/intellectual-property/pcie-dma.html, accessed Feb. 6, 2019.
[9] Data plane development kit (DPDK), https://www.dpdk.org/, accessed Feb. 6, 2019
Introduction
Network function virtualization (NFV) is a viable concept for designing the upcoming fifth generation of mobile communications systems (5G) [1] . A prototype of an NFV-based base station has been developed using a standard server and a hardware accelerator (HWA) [2] . The HWA mainly handles heavy processing such as fast Fourier transform (FFT) and channel coding and decoding. The rest of L1, L2 and L3 functions run on the standard server. In 5G, radio-resource scheduling should also be offloaded to an HWA because it also becomes heavy processing. As shown in Fig. 1 , multiple transmission antennas will be deployed in an ultra-high density arrangement and intensively controlled at a radio-resource scheduler. The scheduler determines the combinations of transmission antennas and mobile terminals (MTs) [3] . To determine the optimal combination, it uses a coordinated radio-resource scheduling algorithm that can effectively suppress interference among multiple transmission antennas [4] . The algorithm computes a transmission weight matrix to estimate the data throughput for a huge number of possible combinations. By iterating this computation, the scheduler determines the optimal combination, which raises the system throughput [5] . Therefore, it needs to compute more and more possible combinations within the scheduling period. An HWA implemented on a field-programmable gate array (FPGA) has shown an excellent performance in coordinated radio-resource scheduling [5] , [6] . For practical use, the memory copy between a central processing unit (CPU) and the HWA is a key to setting aside sufficient time to search for the optimal combination. In particular, the scheduler needs time-series data, such as channel information between transmission antennas and MTs, to compute a transmission weight matrix. Therefore, memory copy occurs every scheduling period of one millisecond [7] . This paper discusses the memory copy operation in the coordinated radio-resource scheduling process and presents an FPGA-based radio-resource scheduler for 5G mobile systems in NFV environments.
FPGA-based coordinated radio-resource scheduler
For implementation of the coordinated radio-resource scheduler with an FPGA-based HWA in 5G mobile systems, this paper discusses two types of functional interfaces and three types of scheduler architectures. The details are described below and summarized in Fig. 2 .
Functional interface
When the scheduler offloads a CPU to the FPGA-based HWA, the HWA starts the computation after copying the channel information in the CPU main memory. When the number of accommodated MTs is large, the scheduler consumes much time for memory copy because the amount of data reaches about a few hundred kilobytes. To minimize the memory copy overhead in the coordinated radio-resource scheduling, as shown in Fig. 2 (a), we devised two types of functional interfaces (interface I and II). In case of the interface I, the software extracts possible MTs, and then it sends them to the HWA. The HWA uses the extracted MTs to generate possible combinations. On the other hand, when the scheduler works with interface II, the HWA extracts possible MTs, and then it generates possible combinations. The scheduler repeats the transmission weight computation and the system throughput computation until the scheduling period expires. As shown in Fig. 2(b) , by repeating the computation, it increases the system throughput and determines the optimal combination.
Physical interfaces
This paper considers three types of scheduler architectures with an FPGA-based HWA. As shown in Fig. 2(c) , PCI Express (PCIe) with a direct memory access technology (DMA) is generally used to mount the FPGA-based HWA into the central unit (intra-CU scheduler). To efficiently operate memory copy from the CPU main memory to the FPGA internal memory, we devised a high-bandwidth memory copy technique, which temporarily stores 512-bit data in the high-bandwidth memory and then divides the data into multiple 32-bit memories.
To enhance the flexibility of the network design, placing the scheduler over the network should be considered. Fig. 2(d) shows a general implementation of a scheduler connected with multiple CUs over the Ethernet (inter-CU scheduler) . In this case, the scheduler collects channel information from multiple CUs, and then it copies collected data to the CPU main memory. After that, the CPU performs memory copy to the FPGA-based HWA. This will result in significant overhead because the CPU needs to perform the memory copy operation twice. To reduce this overhead, as illustrated in Fig. 2(e) , we devised an inter-CU scheduler with a tightly coupled HWA (TC-HWA). The FPGA-based HWA is directly accessible from CUs via the Ethernet. The scheduler collects channel information, and then it directly copies collected data to the FPGA's internal memory.
Evaluation
We analyze the processing time in detail, including the memory copy operation in the coordinated radio-resource scheduling, and then discuss the physical and functional interfaces in terms of the processing speed and flexibility of the scheduling.
Experimental conditions
As shown in Fig. 3(a) , the evaluation platform consists of a standard Linux server with an Intel Xeon processor (clock frequency: 3.2 GHz), and the FPGA-based HWA. The HWA described in [5] was implemented on an FPGA (Xilinx, VCU118) (clock frequency: 125 MHz). Sample scheduling software was run on Linux Ubuntu 16.04. The intra-CU scheduler was implemented by connecting the server and FPGA-based HWA via PCI Express Generation 3.0 x16 lanes with a direct memory access controller (DMAC) [8] . The inter-CU scheduler with the TC-HWA was implemented by connecting the server and FPGA-based HWA via Gigabit Ethernet with the data plane development kit (DPDK) [9] . In addition, we estimated the performance of the inter-CU scheduler by adding processing time for copying from the CU to scheduler main memory and for copying from the scheduler main memory to the FPGA-based HWA. Fig. 3(b) shows the dependence of the number of computed combinations on the processing time for three types of scheduler architectures at 128 MTs. The intra-CU scheduler performs well even when memory copy overhead is included, and it computes approximately a hundred combinations within one millisecond, which is sufficient to search for the optimal combination at 128 MTs [5] . If the system requires low latency, PCIe is apparently a suitable physical interface. It takes the inter-CU scheduler approximately three milliseconds to compute a hundred combinations. This estimation is appropriate because the memory copy occurs twice in this case, which CU main memory is copied to the scheduler main memory, and then the scheduler main memory is copied to the FPGA-based HWA. On the other hand, the inter-CU scheduler with the TC-HWA significantly reduces the processing time. At ten milliseconds, which corresponds to the radio-frame duration, it shows performance comparable to that of the intra-CU scheduler. Even at one millisecond, it computes a hundred combinations within two milliseconds. These results suggested that the inter-CU scheduler with the TC-HWA is a possible option for sharing channel information among multiple CUs over the network. The scheduler can change coordinated transmission antennas more flexibly and dynamically. Moreover, the FPGA-based HWA can be placed not only in the CU but also out of the CU. In this way, the system operator can design the network more flexibly.
Overall performance

Performance breakdown
Figs. 3(c) and (d) show breakdowns the processing time for the intra-CU scheduler and inter-CU scheduler with the TC-HWA at a hundred combinations. The scheduling computation is executed on the dedicated hardware at the same latency, and it does not depend on the number of MTs. In contrast, the memory copy overhead is not negligible, and is 14 -19% in the intra-CU scheduler. As a result, the scheduler takes sufficient time to search for the optimal combination. Even when the number of accommodated MTs is large, the scheduler reduces the memory copy overhead. For the inter-CU scheduler, it is 25% at 32 MTs.
Although the flexibility of the combination search becomes compromised, interface II is favorable in terms of reducing the memory copy overhead even for a large number of accommodated MTs. When the scheduler works with interface I, the memory copy overhead slightly increases as the number of MTs does. This is because the CPU extracts possible MTs, and then it sends them to the HWA. With interface II, the CPU does not extract possible MTs. Instead, it sends all possible MTs, which are stored in a continuous memory region. Therefore, DMA and DPDK work efficiently, and the overhead is smaller than that for interface I. As a result, if the system requires low latency, interface II will be suitable by sacrificing flexibility.
Consequently, as summarized in Fig. 3(e) , the intra-CU scheduler is suitable for shorter scheduling periods of sub-frame duration. Noteworthy is that the inter-CU scheduler with the TC-HWA can dynamically change the coordinated MTs and can extend the flexibility of operating the coordinated scheduling among multiple CUs. This will enable the system operators to design the network more flexibly.
For 5G mobile systems in NFV environments, this paper discussed an architecture of a coordinated radio-resource scheduler consisting of an FPGA-based HWA, and addressed how to reduce the overhead of memory copy operation between the CU and HWA. We devised an intra-CU scheduler with a high-bandwidth memory copy technique and an inter-CU scheduler with a tightly coupled HWA. The experimental results show that the overheads of the intra-CU scheduler and inter-CU scheduler are reduced to approximately 14% and 25%, respectively. Moreover, they indicate that the inter-CU scheduler with the TC-HWA is a suitable architecture for coordinated scheduling among multiple CUs. This will certainly be advantageous in implementing future NFV-based mobile systems.
Acknowledgments
Some of the results reported here were obtained in "The research and development project for realization of the fifth-generation mobile communications system" commissioned by The Ministry of Internal Affairs and Communications, Japan. 
