Abstract-The rapid growth of mobile broadband wireless services in recent years demands high speed data transmission for both access and backhaul networks. With the increase of data rate for 5G access to tens of Gigabits per second (Gbps), higher speed transmission for backhaul network is necessary. Current wireless backhaul systems have been able to achieve the data rate of multiple Gbps, but the ability to deal with significant practical impairments such as large carrier frequency offset and IQ mismatch is still a technical challenge. In this paper, a 20 Gbps digital modem for wireless backhaul applications is proposed. Simulation and field programmable gate array implementation show that the proposed design and signal processing algorithms meet the targeted system performance.
I. INTRODUCTION
The future 5G mobile networks are required to offer data rates of several Gigabits per seconds (Gbps) for tens of thousands of users [1] . As the high speed traffic should be connected to the core network through the backhaul, more stringent requirements will have to be placed on backhaul data transmission in terms of capacity, latency, availability, energy and cost efficiencies [2] .
There are two main physical media used for backhauling: fiber and radio [3] . The fiber is the primary medium to deliver leased synchronous digital services and Ethernet services, and it can offer high data rate backhauling up to multiple tens of Gbps capacity. However it is more costly and less flexible as compared with the wireless backhaul. Currently wireless systems can reach the fiber capacity [4] , and, therefore, it is much effective to use wireless backhaul technology considering network capacity, data rates and operating cost.
Wireless backhaul is also critical in sapce-air-ground integrated networks where high speed transmission between fast moving platforms such as low earth orbit (LEO) satellites is required. Therefore, in addition to high speed, significant practical impairments such as large carrier frequency offset (CFO) and IQ mismatch are challenging issues for wireless backhaul. In this paper, we design and implement a new digital modem which achieves high speed transmission with these practical impairments compensated.
The rest of this paper is organized as follows. In Section II, the basic system requirements and architectures are described. In Section III, the implementation of Ethernet Interface is introduced. Digital signal processing algorithms are described from various aspects including transmitter, receiver and field programmable gate array (FPGA) implementation considerations in section IV. The simulation and implementation results are provided to demonstrate the achievable performance in Section V. Finally, conclusions are drawn in Section VI.
II. SYSTEM DESCRIPTION

A. System Requirements
The backhaul system under development is required to provide up to 20 Gbps raw data rate. The actual data rate at physical (PHY) layer can be adjusted according to the speed of medium access control (MAC) layer. At the maximum speed, 20 Gbps, the bit error rate (BER) is required to be lower than 10 −7 at 14 dB normalized signal-to-noise ratio (SNR). When PHY date rate is reduced, the BER can be further improved. The modulation type is 16QAM, and the demodulation loss is below 5 dB.
The selection of carrier center frequency depends on the radio frequency (RF) band and up/down conversion architecture, which is not the focus of this modem design. However, in order to achieve high speed transmission between fast moving platforms such as LEO satellites, the modem is required to be able to capture and track significantly large CFO up to tens of MHz. 978-1-5090-5932-4/17/$31.00 ©2017 IEEE
B. System Architecture
The digital modem is the key component of the high speed backhaul communication system. Fig. 1 shows the backhaul system architecture with only half of the proposed 20 Gbps digital modem. The complete digital modem is composed of two baseband digital signal processing platforms, each capable of processing 10 Gbps data rate, and an intermediate frequency (IF) module for transmitter and receiver respectively. When fully operated, the digital modem can transmit and receive Ethernet traffic at 20 Gbps data rate simultaneously.
According to the system design, the 20 Gbps digital modem is equipped with two 10 Gigabit Ethernet (GbE) interfaces. The data bits from one 10 GbE interface are split into two streams, each having 5 Gbps data rate. Each 5 Gbps data stream is transmitted over one 2.5 GHz baseband channel which is shifted to the IF band. There are total four such 2.5 GHz channels for the system. The digital modem consists of two baseband digital signal processing (DSP) platforms, each having four D/A and A/D devices respectively. Each baseband DSP platform is capable of transmission and reception of two 5 Gbps data streams.
At the transmitter, the IF module up-convers the I/Q modulated baseband signals generated by the baseband digital platforms to IF signals. There are total 4 channels of baseband signals, each with 2.5 GHz bandwidth. Two channels are combined to form a 5 GHz bandwidth baseband. These two 5 GHz channels are further up-converted to IF with lower and upper sidebands respectively which are finally combined to form a 10 GHz bandwidth IF signal.
At the receiver, the received IF signal is bandpass filtered to obtain the lower sideband and upper sideband respectively. Each sideband is then down converted to 5 GHz bandwidth baseband signal, and two channels of 2.5 GHz bandwidth baseband signals are finally received by the baseband digital platforms. Information data are subsequently demodulated by the digital modem.
C. Physical Layer Protocol
The PHY layer frame consists of a preamble, pilots and a sequence of symbols carrying data from MAC, as illustrated in Fig. 2 .
Preamble is used for synchronization and channel estimation. The preamble is composed of two blocks, each consisting of 64 samples. The two blocks are sequential and exactly the same.
The pilots in the data symbols are PN-coded symbols corresponding to the constellation points (1+j) or -(1+j). To prevent the generation of spectral lines, each pilot is multiplied by one code of a PN sequence. Each frame starts with the pilot using the first code of the PN sequence, and the PN sequence is not continued between adjacent frames. One Pilot is added in each data block with fixed number of symbols and multiple pilots are spread over the whole frame in order to track the channel variation and compensate for phase noise.
The forward error correction uses the 802.11n low density parity check (LDPC) code [5] with 1944 bits per block and the coding rate is 3/4. The encoded data bits are then divided into data symbols, addressing the 16 QAM modulation. There are 252 pilots inserted into 14 LDPC blocks for one PHY frame. There are 1176 clock periods occupied by 14 LDPC blocks and 252 pilots, and 16 clock periods for preamble. The system clock frequency is 312 MHz. The total time of one frame is (1176+16)/312 MHz ≈ 3.82 us. The user data is 1944*3/4*14 = 20412 bits in one frame, therefore, the rate of one band is 20412/3.82 us ≈ 5.34 Gbps. For one digital baseband platform, there are two bands implemented separately. As there are two digital baseband platforms in the system, the data rate is 5.34*4=21.36 Gbps, and thus the targeted 20 Gbps data rate can be achieved.
III. ETHERNET INTERFACE IMPLEMENTATION
A. Functional Overview
The MAC acts as an interface between the digital modem's PHY layer and the network's physical layer [6] . The MAC can balance network load across two channels when both channels are enabled. One PHY channel may optionally be disabled via the MAC, and PHY channels can also transmit and receive data to/from MAC with fixed length frames.
Considering the reliability and robustness, MAC should preserve Ethernet framing across the radio link. The additional control characters, which are encoded in-band with the data, are self-synchronized so that MAC functionality can be recovered after any variable bit shift or random data loss.
B. Interfacing Radio Physical Coding Sub-layer (PCS)
There are several possible architectures for interfacing the radio PCS upper interface to an Ethernet fiber [7] , but the most suitable one is the 66B bridge architecture in order to achieve implementation simplicity and low processing delay. The Ethernet PCS block performs the standard Ethernet PCS layer functionality. Native 66 bit blocks are passed directly from the network interface to the MAC. The blocks must be de-scrambled in order to accurately identify the idle control characters that fill in the gaps between Ethernet frames. Idle deletion is important to ensure that buffer memory is utilized efficiently. The Ethernet idle block code can be used for wireless packet padding to keep the transmitter data without guarding interval.
The idle block insertion and deletion functionalities within the TX FIFO and RX FIFO are necessary for a constant PHY data rate. The PHY's D/A and A/D modules generate the clocks (312 MHz) for transmitter and receiver respectively. However, the Ethernet clocks (156 MHz) of the GTX receiver and the GTX transmitter are sourced from the GTX transceiver. The PHY transmitter reads RX FIFO regularly no matter MAC blocks are valid or idle. However, the PHY receiver only sends valid blocks into TX FIFO. Therefore, the MAC should insert and identify the idle blocks for controlling the data rate and sending useful data to the GTX transceiver. The traffic monitor checks the data stream for PHY layer. Ethernet packet framing and checksum errors are calculated in two places at the output of the TX FIFO (Eth tx) and the output of the RX FIFO (Eth rx). This is useful for troubleshooting purposes while the link is operational with user traffic. Fig. 3 shows the Ethernet interface architerture.
IV. SIGNAL PROCESSING IMPLEMENTATION
A. Transmitter Signal Processing
The single carrier and I/Q modulation techniques are adopted by each PHY channel. Date symbols with rate 1.875 Gsps are transmitted continuously without guarding interval. At the transmitter, LDPC is used for encoding the data bits from Ethernet interface and then the coded bits are mapped into data symbols using 16QAM. For each frame, the preamble is added at the start of frame. The data symbols finally go through a root raised cosine (RRC) pulse shaping filter. Since the signal sampling rate is 2.5 Gsps, sampling rate conversion (SRC) is necessary before pulse shaping. The signal processing diagram for the transmitter is shown in Fig. 4 . 
B. Receiver Signal Processing
At the receiver, the first process after receiving data from A/D is frame synchronization for each PHY channel. Once the preamble is captured in this process, it is then used to estimate the channel response, I/Q imbalance, and CFO. The receiver filters, which convert the sample rate to symbol rate and correct all the practical impairments, will be constructed through these estimates. The data demapping and LDPC decoding processes are followed by recovering the data symbols. The signal processing diagram for the receiver is shown in Fig.  5 .
Synchronization includes coarse timing (packet acquisition) and fine timing. After the system power-up, coarse timing tries to capture the preamble which contains two training sequences. It can be implemented with autocorrelation operation, exploiting the similarity between the two training sequences. Initial CFO estimation is also performed using the autocorrelation outputs.
Fine timing, which starts followed by coarse timing achieved, is realized by computing the cross-correlation of a local template training signal and the received signal. This process can be implemented in either time domain or frequency domain. In this system, fine timing is implemented in frequency domain requiring less firmware resources. After the coarse timing point is obtained, the received training signal is located at the second half of the training sequence. A signal segment of samples is taken backwards from the coarse timing point. This training signal segment is converted to the frequency domain, and the outputs are multiplied by the conjugate of the frequency domain training sequence. The product is then converted back to the time domain, and the location of the time domain peak signal is recorded. This peak location indicates the distance from the coarse timing point to the end of the training sequence, and hence fine timing point is obtained.
Channel estimation is basically performed using the preamble obtained from each received frame and comparing it with the known training sequence. However, it should be combined with the I/Q imbalance compensation, assuming that the I/Q imbalance has been estimated separately. Due to the I/Q modulation architecture adopted in the modem system, I/Q imbalance is a significant issue and the compensation has to be done at the receiver.
C. FPGA Implementation Considerations
As defined in the PHY protocal, PHY frames are continuously transmitted, however, MAC frames arrive randomly with the maximum 10 Gbps for each DSP platform. Aimed at adapting this condition, data padding is necessary to add in some idle frames to accommodate the time-varying speed of MAC. In the meantime, in order to reduce the implementation complexity, a padding frame is filled as a whole. In this way, transmitter side generates either data frame or idle frame separately and receiver side can easily identify the padding frame. For the padding frame, a specific idle block is selected for distinguishing from data block so that it can be removed at receiver. Considering random bit error could happen in both idle block and data block, a suitable strategy is necessary to reduce the impact of such error conditions.
At the transmitter, each filter is an RRC pulse shaping filter sampled from 1.875 Gsps to 2.5 Gsps with different time offsets. The filter coefficients are stored in distributed RAM as look-up table (LUT). For the transmitter filter, eight samples should be generated in each clock period for attaining the data rate requirement of system. There are substantial adders required in two transmitter filters for each baseband DSP platform. Due to the timing of high speed system clock (312 MHz), the number of adders for long bit-width signal should be limited in one clock period. However, if more adders are used in one clock period, the usage rate of LUTs is higher. So, it is necessary to consider both the timing of system clock and the usage rate together. After comparing the different numbers of adders used in each clock period, we found that three adders achieve good trade-off for this system.
For achieving better performance, the iteration time for each LDPC decoding block should be auto-adjusted to adapt to the speed of transmitter and channel condition. There is a big buffer used before the LDPC decoder. When there are less data stored, the number of iterations for LDPC decoding can be increased for better decoding performance. However, when there are more data stored, the number of iterations can be decreased to ensure the data are processed in time. For two bands of each digital platform, all processes are independent except for the LDPC decoder. Two bands, which may work at different channel conditions, share the LDPC decoder cores for optimizing the BER.
V. SIMULATION AND IMPLEMENTATION RESULTS
A. Simulation Results
The BER performance of the designed 20 Gbps digital modem is simulated using Matlab under various channel conditions. There are a number of factors to be considered, such as the CFO, IQ mismatch, and LDPC decoding performance, which can affect the overall system BER. Fig. 6 shows the BER simulation results under CFO with IQ mismatch channel and ideal Gaussian channel respectively. For the CFO with IQ mismatch channel, the CFO is set to 10 MHz and the IQ gain and phase mismatch are set to 1 dB and 5 degrees respectively. For the ideal Gaussian channel, no CFO or IQ mismatch is assumed. We see that the performance gap under practical impairments is about 2 dB. Fig. 7 shows the simulated LDPC decoding performance with 16QAM for 100000 LDPC blocks under ideal Gaussian channel. Since each block has 1944x3/4=1458 bits, the total bits tested is 100000x1458= 145800000. From these results we see that at SNR=14 dB, the BER is lower than 10 −7 . Since each 16QAM symbol has 4 bits, the equivalent Eb/N0 can be calculated as 14 − 6 = 8 dB. Considering 5 dB demodulation loss (including all the implementation impairments such as I/Q imbalance, CFO, timing error, phase noise, etc.), the system will be able to achieve 10 −7 BER at Eb/N0 = 13 dB, which satisfies the targeted performance requirement.
B. FPGA Implementation
The baseband DSP platforms adopt FPGA devices produced by Xilinx. The Xilinx Virtex 7 is currently the most powerful device, and hence the Virtex 7-690t with sufficient FPGA resource is selected. The signal processing blocks for the main data streams, such as LDPC encoding, 16QAM mapping, transmit filter, receiver filter, 16QAM demapping, LDPC decoding, and A/D and D/A modules, are implemented. The resource usage for the completed modules on some typical cells including LUTs, slice registers, block RAMs and multipliers in FPGA are shown in Table I . Among the total 3600 multipliers in the device, there are about 2700 multipliers used. Only the receiver filter module alone has used 1344 multipliers. Considering the limited multipliers in FPGA and high speed system clock, the timing constraints are essential for meeting the timing requirement through effectively alocating the multipliers and, at the same time, adding constraints for specific modules, such as autocorrelation, channel estimation and LDPC decoder core. Fig. 8 shows the route result for the completed modules. The yellow cells for the receiver filter are restrained on the left half side of the device due to the large numbers of multipliers used for receiver filter but less other cells used for this module. Therefore, some other modules, which do not use multiplier, can be placed at the same area of the receiver filter. In this way, it is much easier to meet the timing for the whole system. Fig. 9 shows the report of design timing summary.
The predicted FPGA usage for channel estimation and equalization is about 40000 slice LUTs, 50000 slice registers, 90 block RAMs and 1300 multipliers. Thus, the total FPGA Table II . The device of Virtex7-690t comprises 433200 slice LUTs, 866400 slice registers, 1470 block RAMs and 3600 multipliers [8] . One baseband digital signal processing platform will only use less than 50% of total slice LUTs, slice registers, and block RAMs. Only the multipliers may be used up to 75% of the total available multipliers.
VI. CONCLUSION In this paper, the design and implementation of a 20 Gbps digital modem for high speed wireless backhaul applications have been presented. In additional to achieving the 20 Gbps data rate, practical impairments such as very large CFO and significant IQ mismatch are also dealt with by digital modem. The requirements, architecture and signal processing of the system are described to show that the system can be implemented with currently available FPGA technology. From the simulation results, the design satisfies the targeted system performance requirements. The critical modules have been implemented in FPGA through optimized algorithms. The total FPGA usage is reasonable for achieving system clock timing. Future work includes reducing the signal processing latency to meet the high speed and low latency requirements at the same time for various wireless backhaul applications.
