We develop a real-time discrete multi-tone (DMT) transceiver based on field programmable gate array (FPGA) chips for a single chip silicon-substrate light-emitting diode (LED) based underwater visible light communication (UVLC). On-chip resource usages are analyzed and discussed. To improve bit error rate (BER) performance, a novel channel estimation technique utilizing hybrid inter-symbol frequency-averaging (Inter-SFA) and intrasymbol frequency-averaging (Intra-SFA) is proposed and investigated. The real-time DMT transceiver is experimentally verified in a silicon substrate blue LED-based UVLC system with a 1.2 m underwater link. By using the enhanced channel estimation, a gross bit rate of 2.34 Gbit/s real-time DMT signal over 1.2 m underwater transmission can be achieved with the BER of 3.5 × 10 −3 . What's more, multiple-symbol interleaved Reed-Solomon (RS) codes are employed to further improve BER performance. The real-time measured post-FEC BER of the DMT-UVLC with multiple-symbol interleaved RS (255, 191) codes can be improved by more than six orders of magnitude. As a result, error-free (less than 1 × 10 −9 ) transmission is observed in our real-time experiment. Furthermore, 1.485 Gbit/s 720p high-definition video underwater transmission is successfully demonstrated. To the best of our knowledge, it is the first time to demonstrate a real-time LED-based UVLC system with DMT modulation beyond Gbit/s.
Introduction
Visible light communication (VLC) is an emerging technology in optical wireless communication and has been extensively studied in recent years [1] - [8] . Compared to the conventional radio frequency (RF) wireless communications, VLC has some attractive advantages, such as unlicensed potential high bandwidth, non-interference with RF bands and higher security [9] . Besides indoor and outdoor wireless access applications, VLC can find potential application scenarios like aircraft, hospitals and underwater communication. There are two basic light sources, i.e., light-emitting diode (LED) and laser diode (LD), which are usually used for the VLC system. The LD-based VLC can provide large modulation bandwidth and easily achieve long-distance transmission with a high data rate. However, it requires precise collimation between LD and photodiode (PD). In [10] , Lee LDs with 2 m free-space transmission with a bit-and-power loading technique in [11] . Compared to the LD source, the LED is very cost-effective and can be used for both communication and illumination. The main drawback is the limited modulation bandwidth for the LED-based VLC system. Therefore, many advanced modulation formats, such as amplitude and phase modulation (CAP) [12] , [13] , carrierless pulse amplitude modulation (PAM) [14] and orthogonal frequency-division multiplexing/discrete multi-tone (OFDM/DMT) [15] - [17] , have been investigated to achieve higher spectral efficiency (SE). Meanwhile, pre/post-equalization and wavelength-division multiplexing (WDM) techniques are also used to further enhance the data transmission rate. In 2013, a 3.22 Gbit/s WDM-CAP-VLC system with a single RGB LED over 0.25 m air transmission was experimentally demonstrated by F. M Wu et al. in [12] . Y. Wang et al. reported an 8 Gbit/s RGBY LED-based WDM-VLC system by utilizing high-order CAP modulation and hybrid post-equalization technique [13] . N. Chi et al. presented a PAM-8 modulation with phase-shifted Manchester coding enabled RGB-LED based WDM-VLC system with a data rate of 3.375 Gbit/s over 1 m indoor free-space transmission [14] . Recently, beyond 10 Gbit/s LED-based WDM-VLC systems with OFDM/DMT modulation is also reported by researchers at the University of Oxford in 2018 [15] , Fudan University in 2019 [16] and the University of Edinburgh in 2019 [17] , respectively.
Nowadays, underwater VLC (UVLC) is of great interest to the scientific community, industry, and military. It has drawn much attention from researchers after free-space VLC. In addition, acoustic and RF are the two other methods to realize underwater communication. Acoustic communication is a mature way to achieve long-distance transmission with a low data rate (< 500 kbit/s) and high latency [18] . While RF communication can improve the data rate to tens of Mbit/s, but suffer from high loss due to the high conductivity of seawater. It indicated that RF attenuation at 1MHz in seawater could be up to more than 20 dB/m [19] . Thus, the transmission distance is limited (< 10 m), and a huge antenna size is also required. However, there is relatively low water attenuation in the blue-green spectrum window. Therefore, the blue or green LDs or LEDs are more suitable for high-speed UVLC over moderate ranges. By using blue LD and post nonlinear equalization technique, 16.6 Gbit/s adaptive bit-power loading enabled DMT-based UVLC system over 5 m water link was achieved with the BER below 3.8 × 10 −3 [20] . Compared to LD, LED relaxes the requirement on collimation with PD. It can be easily and cost-effectively integrated into LED arrays and achieve hundreds of Watts for underwater illumination. Furthermore, LED arrays may be used to realize a higher data rate and longer distance transmission with the multiple input multiple output (MIMO) technique. These advantages make LED-based UVLC a promising solution for high-speed underwater communication. In 2019, J. Shi et al. demonstrated a 14.6 Gbit/s discrete Fourier transform spread OFDM-UVLC system utilizing a silicon substrate common-anode RGBCY LED lamp consists of six LED chips [21] . In the meantime, 15.17 Gbit/s bit loading enabled DMT transmissions through 1.2 m underwater link with an LED lamp was also presented by the same research group in [22] . To the best of our knowledge, it is the highest data rate ever reported for LED-based UVLC systems.
In addition to limited modulation bandwidth and nonlinearity arose from LED and amplification circuits, the absorption, scattering and diffraction effect will cause increased power loss in the UVLC system. It also induces more nonlinearity compared to free-space VLC systems and results in severe inter-symbol interference (ISI) [23] . DMT is a special type of OFDM technique with low implementation complexity compared to the conventional OFDM. It has high spectral efficiency (SE) with high QAM modulation formats and is robust against ISI and power fading. Recently, DMT-based UVLC has been extensively studied by means of offline approaches [20] - [23] . The DSP algorithms are realized with floating-point operations in computer software, which may not work as intended in actual implementation or even be unrealistic to implement in real hardware. It is interesting to fully verify the feasibility of DSP algorithms via real-time experiments. However, hardware implementation and demonstration of the real-time LED-based UVLC system are seldom reported in the literature.
In this paper, a 2.34 Gbit/s real-time DMT transceiver is implemented on field programmable gate array (FPGA) chips for a blue silicon-substrate LED-based UVLC system. Accurate channel estimation is achieved by cascaded inter-symbol frequency averaging (Inter-SFA) and intra-symbol frequency averaging (Intra-SFA). Moreover, multiple-symbol interleaved Reed-Solomon (RS) codes are employed to further improve the BER performance. Both the number of training symbols and Intra-SFA taps for the hybrid channel estimation are optimized. Meanwhile, real-time BER values are measured after underwater transmission. Subsequently, the transmission of a 720p high-definition (HD) video using such a real-time DMT-based UVLC system over 1.2 m water link is successfully demonstrated. The main contributions of this work are summarized in the following points: (1) a hybrid channel estimation method is proposed for DMT-based UVLC systems to achieve accurate estimate; (2) a simple multiple interleaved RS coding is applied to improve the BER performance; (3) a real-time DMT transceiver is designed and implemented with FPGAs. The DSP algorithms are experimentally verified in a single blue LED-based real-time UVLC transmission system with a data rate beyond Gbit/s for the first time.
The rest of the paper is arranged as follows. In Section 2, we describe in detail the architecture of the real-time DMT transceiver. And On-chip resource usages are analyzed and discussed. The experimental setup of the developed real-time UVLC system with video transmission is described in Section 3. In Section 4, both offline and real-time results are presented and discussed. Conclusions are drawn in Section 5.
Real-Time DMT Transceiver Architecture and Its FPGA Implementation
To design of a suitable real-time DMT transceiver architecture, the DMT frame structure and parameters should be first determined. In this work, a DMT frame consists of one 512-point synchronization pattern (SP) for timing synchronization, 16 training symbols (TS) for channel estimation and 255 RS-coded DMT symbols. As shown in Fig. 1 , the time-domain DMT frame structure is presented. The detailed parameters for the baseband DMT transceiver are given in Table 1 .
As we can see from Table 1 , the net bit rate after excluding all overheads (e.g., SP, TS, CP, and FEC overhead) can reach 1.58 Gbit/s. Therefore, it is possible to realize 1.485 Gbit/s 720p video signal transmission after error correction. The remaining transmission capacity is left unused. Here, the SP and TSs are encoded by different binary phase-shift keying (BSPK) symbols in the frequency-domain. And there is no CP for SP and no FEC for both SP and TSs. The other parameters are the same as the data-carrying DMT symbol.
DMT Transmitter
Based on the time-domain structure and the related parameters mentioned above, a real-time DMT transmitter structure and its DSP flows are illustrated in Fig. 2 . There are two types of data sources, i.e., pseudo-random binary sequence (PRBS) and video data, for the BER test and video transmission, respectively. The data source selection is controlled by an on-board dual in-line package (DIP) switch. The offline generated PRBS is stored in the read-only memory (ROM) on an FPGA chip. The PRBS ROM module outputs the 160-bit raw data in parallel with a rate close to the net bit rate. The outputted data are fed into a first input first output (FIFO) module (FIFO #1). For the video transmission, a 1.485 Gbit/s HD serial data interface (HD-SDI) video signal is received by using a gigabit transceiver (GTX) IP core with a reference clock of 148.5 MHz. It should be noted that the GTX operates at 74.25 MHz and outputs 20-bit data in parallel per clock cycle. After data combination, 160-bit data in parallel are fed into the second FIFO (FIFO #2) per 8 clock cycles. Since the GTX in the receiver cannot realize clock recovery using the recovered data bits, the recovered clock at 74.25 MHz by the transmitter GTX is configured as an output.
When the data length of FIFO #1 or FIFO #2 reaches 191, the data are continuously readout for 20 RS (255, 191) encoders in parallel. Due to the limited bandwidth of the LED-based UVLC transmission system, the error distribution has larger fluctuations over subcarriers (SCs). It may reduce the FEC decoding performance in the receiver. Therefore, a simple multiple-symbols interleaving scheme is employed. The principle of the interleaving scheme can be found in our previous work [24] . The interleaved RS codes are combined into 480-bit data per 3 clock cycles (156.25 MHz) and stored into the third FIFO (FIFO #3). Once the data length of the FIFO #3 reaches 255, the sixteen TSs are first generated, which is controlled by the Control FSM module, and then the 480-bit data are read out for QAM mapping. However, in the video transmission test, we found that the GTX received 20-bit data maintains a constant pattern (not all zeros or ones) and lasts more than several hundred clock cycles in some times. This leads to a high peak-to-average power ratio of the DMT signal. As a result, the increased clipping noise and nonlinear distortion may deteriorate the video transmission performance. Thus, a simple exclusive OR (XOR) logical operation is used to solve this issue. After that, the 480-bit data are mapped into 80 Gray-coded 64 QAM complex-valued symbols in a 16-bit fixed-point representation. The first SC is set to zero for DC-bias. Two low-frequency SCs are not used due to the low signal-to-noise ratio (SNR). Other 173 SCs located on the high-frequency bins are filled with zeros for oversampling. Besides, Hermitian symmetry is performed to realize the real-valued DMT signal. The 16 mapped symbols in parallel are sent to a 512-point partial-parallel inverse fast Fourier transform (IFFT) module to realize the orthogonal modulation. Digital clipping is performed for the IFFT outputted real-valued data to reduce PAPR as well as DAC quantitation noise. The clipped data are scaled to 14-bit for the DAC with a 14-bit vertical resolution. A 16-point cyclic prefix (CP) is appended to avoid inter-symbol interference (ISI). The synchronization pattern ROM is controlled by the Control FSM module and sent to the ADC interface module before 16 TSs and 255 data-carrying DMT symbols. The functions of the ADC interface module include signed-to-unsigned data conversion, data rearrangement, and parallel-to-serial conversion. In addition, 156.25 MHz FPGA clock is generated in this module by using the 625 MHz clock provided by the DAC chip. On-board 200 MHz clock is used to initialize the DAC via the serial peripheral interface (SPI) bus. Subsequently, the 2.5 GS/s samples on the FPGA are fed into the DAC clocked by an external 2.5 GHz clock source through a low voltage differential signaling (LVDS) interface.
Besides data source selection, the RS encoding, symbol interleaving, and scrambling modules are also controlled by the DIP switch. In this way, we can investigate the effects of RS coding and symbol interleaving on the real-time BER performance. While the scrambling function is only utilized to improve video transmission performance. Fig. 3 shows that the DMT receiver architecture and the corresponding DSP flow. After the UVLC link transmission, the received baseband DMT signal is first sampled by a 1.25 GSa/s 10-bit ADC clocked by a 10 MHz reference clock. To avoid sampling clock frequency offset (SFO) and realize video transmission, the reference clock comes from the 2.5 GHz clock source in the transmitter. The SFO compensation methods have been studied in our previous works [25] , [26] . The converted samples are sent to the receiver FPGA via LVDS interface. Besides, a 625 MHz clock is also provided for FPGA to generate 156. 25 fixed-point outputs is used for demodulation. Based on the conventional least square (LS) estimate, Intra-SFA with single TS has been proposed and studied in fiber-optic transmission systems [27] , [28] . However, the estimates on the edge subcarriers may be inaccurate due to the use of fewer taps for averaging. In this paper, inter-SFA over multiple TSs combined with the intra-SFA enhanced channel estimation algorithm is employed to obtain the accuracy channel estimate. The equalized data of each DMT symbol are demapped into 480-bit binary data. In the video transmission case, the demapped bits are de-scrambled with 480-bit XOR operations. Before symbol de-interleaving, the 480-bit parallel data are segmented to three parts each contains 160 bits and stored in the fourth FIFO (FIFO #4). When the length of data stored in the FIFO #4 reaches 255, the 160-bit data are continuously readout for the multiple-symbol de-interleaving module. The de-interleaved data are decoded by 20 RS decoders in parallel, combined into 480-bit data and stored in the FIFO #5. If PRBS is sent at the transmitter, the errors can be correctly counted over 65,536 DMT frames. Otherwise, the 1.485 Gbit/s HD-SDI video signal will be recovered by using the GTX clocked by the 74.25 MHz clock.
DMT Receiver
Similarly, an on-board DIP switch is enabled to control the function of the corresponding modules. It should be consistent with that of the transmitter.
FPGA Implementation
In this work, we design the DMT transceiver based on FPGAs with Verilog hardware description language (HDL). The DMT transmitter and receiver are implemented on the Xilinx FPGA evaluation board ML605 equipped with a single FPGA chip XC6VLX240T-1FF1156 and FPGA evaluation board VC707 with an FPGA chip XC7VX485T-2FFG1761, respectively. The on-chip resource usages of the DMT transceiver reported by the Xilinx ISE software tool are shown in Table 2 .
Each Vertix-6 or 7 series FPGA slice contains four look-up tables (LUTs) and eight registers (or flip-flops), which can be used to implement combinatorial as well as sequential circuits. LUTs are mainly used for logical operations. Some of the LUTs have additional functionality and can be configured as distributed random access memory (RAM) or shift registers. In the sequential circuits, it often is desirable to store the temporal results such as LUT output by using registers. Besides, the LUTs combined with the dedicated carry logic can implement arithmetic operations (e.g., addition and subtraction) effectively [29] . On-chip large memories, i.e., block RAM and FIFO, which provide 18 Kb and 36 Kb of memory storages. It is quite expensive if implemented by employing LUTs. And the dedicated hard blocks DSP48E1s are used to implement multiplication operations effectively on FPGA.
We can see from Table 2 , more than twice as many slices are used in the receiver compared to the transmitter. It is the fact that most of the key DSP algorithms like timing synchronization, channel estimation, and equalization are all located in the receiver. In the transmitter, 4% slice registers and 9% slice LUTs are utilized to implement 20 RS encoders. The 512-point 16-path parallel IFFT module consumes 8% slice registers and 11% slice LUTs as well as 140 DSP48E1s. To realize data storages, 28 block RAMs are used for PRBS ROM and the three FIFOs. It should be pointed out that, to meet timing requirements, not all of LUTs and registers in each slice are used. Thus, there is no definite proportional relationship on resource usages between slices and slice registers or slice LUTs. In the receiver, there are only 2,521 (0.4%) slice registers and 2,997 (1%) slice LUTs for the implementation of the low-complexity timing synchronization. And 128 block RAMs are used to reduce the latency of timing metric calculation. The 256-point 8-path parallel FFT module consumes 2% slice registers, 3% slice LUTs, and 60 DSP48E1s. The most resource-consuming modules are channel estimation and RS decoders. Up to 24% slice registers and 38% slice LUTs are used to implement these two modules. Forty-eight DSP48E1s are employed to realize channel equalization.
Experimental Setup
The block diagram and photograph of the experimental setup are presented in Fig. 4 . A silicon substrate blue LED-based UVLC transmission platform is established to investigate the BER performance as well as achieve HD video transmission. The two FPGA chips are well configured before experimental measurements. In the transmitter, a laptop is used as the video source and outputs a 720p video signal via the HDMI interface. Right after the uncompressed video signal is converted to a 1.485 Gbit/s non-return-zero (NRZ)-coded OOK signal by an HDMI-to-SDI converter. The OOK signal is equalized and sent to the transmitter FPGA by a cable equalizer with the differential outputs. The received video signal and PRBS data stored in FPGA are selected by enabling the onboard DIP switch. The bitstream from either video signal or PRBS ROM is modulated by the real-time DMT transmitter, which generates the real-valued signal for the 2.5 GSa/s DAC. A 3-dB bandwidth of ∼1 GHz low-pass filter (LPF) was employed to the images of the converted signal. Due to the limited modulation bandwidth of the LED, a bridge-T based pre-equalization circuit with a bandwidth of ∼500 MHz is used to pre-compensate power fading on the high-frequency SCs. The pre-compensated signal is amplified by an electrical amplifier (EA) with a gain of 25 dB and superimposed with 120 mA LED bias current using a bias-tee to drive the blue LED chip [30] . A lens is placed after LED is utilized to obtain parallel light rays for underwater transmission. To emulate the underwater link, a 1.2 m water tank is used in our experiments.
In the receiver end, we use the Plano-convex lens to focus the light, which has a focal length of 20 cm and a diameter of 10 cm. The lens focuses the light, and then an aperture is placed in front of a PIN diode to regulate the receiving side light power. The distance between the lens and the PIN diode is 19.53 cm. The differential receiver consists of a PIN diode (Hamamatsu, S10784) for signal detection and a low-power trans-impedance amplifier (TIA, Maxim MAX5665) for signal amplification. Two EAs further amplify the received differential signal. One of the amplified signals is captured by a digital storage oscilloscope (DSO, Agilent DSO54855A) for observations; the other is sampled by the 1.25 GSa/s 10-bit ADC and sent to FPGA for real-time DMT demodulation. Subsequently, error count for the PRBS data and video recovery are realized after the demodulation.
In addition, the FPGA received samples from ADC and real-time recorded error counts are also captured by using Xilinx ChipScope Pro tools and uploaded to the laptop via JTAG interface. Offline data processing is also performed. It should be mentioned that the on-chip resource usages for the capture of the internal signals are not involved in Table 2 .
Results and Discussion

Offline Measurements
Channel estimation is offline studied to identify the optimal number of TSs and taps of intra-ISFA for the real-time DMT based UVLC system. The channel estimation is first realized by using the LS method. Fig. 5(a) is a plot of normalized amplitude response of the 16 TSs with respect to mean value as a function of the SC index. It exhibits that there is more than 15 dB power fading on SCs in high-frequency edge compared to that of SCs around the DC component. Besides, the amplitude fluctuations on some SCs exceed 2 dB. However, high QAM modulation formats like 64-QAM are very sensitive to this fluctuation. To enhance the accuracy of channel estimation, the averaging of multiple TSs in frequency-domain, here we call this method inter-SFA, combined with the intra-SFA method was applied and investigated. As shown in Fig. 5(b) , the measured BER performance is improved by using inter-SFA when the number of TS increases. However, the increment of the BER performance is slight when the TS number is more than 16 in our measurements. Meanwhile, the BER performance can be further improved by employing intra-SFA with suitable taps. With the optimal intra-SFA taps of 11, it can be seen that the BER value is reduced to 6.0e-3 for the single TS case. But such a BER value may be more than the threshold of RS (255, 191) . When inter-SFA with 16 TSs combined with 5-tap intra-SFA is utilized to enhance the LS-based channel estimation, the BER can be reduced to 3.6e-3. Fig. 6 . It indicates that the hybrid channel estimation, i.e., inter-SFA and intra-SFA, can effectively suppress noises and non-linear interference and achieve a more accurate estimate. More than 3 dB improvement in error vector magnitude (EVM) performance can be observed in Figs. 6(b) and 6(e).
Transmission Performance of the Real-Time DMT-Based UVLC System
After 1.2 m underwater transmission, the errors are counted over 65,536 DMT frame containing a total of 65,536 × 191 × 480 = 6,008,340,480 bits. The error counts both before and after FEC decoding are captured by Xilinx ChipScope Pro tool. The corresponding photographs are shown in Fig. 7 . The ranges of the Frame Count and Symbol Count are 0∼65,535 and 0∼191, respectively. The errors in each DMT symbol are also counted for observation. The total errors are shown in the last line of Figs. 7(a) and 7(b). It indicates that the pre-FEC errors are 22,092,288, and the corresponding BER value is 3.7e-3. It is in good agreement with the offline BER. Error-free is observed for the post-FEC case. It should be pointed out that this measurement enabled by symbol interleaving and inter-SFA combined with intra-SFA channel estimation. The effects of symbol interleaving and intra-SFA on the BER performance with inter-SFA are also investigated. The real-time measured BER values are shown in Table 3 . There is similar BER performance between with and without symbol interleaving before FEC decoding in both with and without intra-SFA cases. When intra-SFA is not enabled, the post-FEC BER can be improved by more than three orders of magnitude compared to that of without symbol interleaving. A further improvement in the BER performance can be achieved by using the intra-SFA enhanced channel estimation. As a result, we successfully demonstrate the real-time 1.58 Gbit/s error-free (<1e-9) data transmission.
With the help of symbol interleaved RS codes and hybrid channel estimation method, we successfully demonstrate a 1.485 Gbit/s 720p high-definition video transmission as a proof of concept on the application level. The demonstration of the video transmission is recorded and uploaded to a social media site. For now, you can watch it at this link: https://youtu.be/E-ycjT5xdT0.
Conclusions
In this work, we designed and implemented a 2.34 Gbit/s real-time DMT transceiver based on FPGA chips, for a blue LED-based UVLC transmission system. The detailed structure of the DMT transceiver is described and the on-chip resource usages after implementation are discussed. Inter-SFA combined with intra-SFA enhanced LS-based channel estimation was studied to improve the accuracy of the estimate. Moreover, a multiple-symbol interleaving technique is applied to further improve the BER performance of the proposed real-time system. The real-time experimental results showed that the post-FEC BER performance can be significantly improved by using the hybrid channel estimation and symbol interleaved RS coding techniques. Error-free real-time DMT signal with a net data rate of 1.58 Gbit/s transmissions over 1.2 m underwater link was achieved. To fully verify the feasibility of the real-time DMT system, a 720p HD video transmission in such a real-time UVLC system was also successfully demonstrated.
