Abstract-This paper addresses the problem of achieving high bandwidth in a DLL design for OFDM based VLC broadcast systems. It describes the implementation of efficient Data Link Layer (DLL) and Forward Error Correction (FEC) modules in a Xilinx FPGA. The proposed DLL aims at furnishing the adequate means to fragment and route both high data-rate (HDR) and moderate data-rate (MDR) service requests while maintaining a continuous transmission flow. The FEC modules aims at providing sufficient error correction capabilities with reasonable computation overheads. Another goal was to develop these modules under a globally asynchronous locally synchronous paradigm, ensuring high modularity and performance.
I. INTRODUCTION
VLC (Visible Light Communication) is an emerging field in optical wireless communications, where white light emitting diodes (LEDs) can be simultaneously used for illumination and data communications. Since 2011, VLC technology has gain momentum supported by the release of the IEEE 802.15.7 draft standard that define the Physical and Medium Access Control (MAC) layers, supporting multiple diverse topologies with data rates up to 96 Mb/s, for indoor and outdoor applications [1] .
VLC systems have a myriad of potential applications. Most of which have focused on applications where the LED technology is already used for lighting purposes and thus, communications can support new value added services. Some of these services may include the optical wireless broadcasting of multimedia streams, following the paradigm of digital fountains [2] . This approach can obviate the up-link issue in illumination compatible VLC systems, as the typical retransmissions of TCP based systems are not necessary. In this paper, we describe the FPGA implementation of DLL and FEC modules, compatible with broadcast services. FPGAs are a natural choice of technology as they can efficiently implement the DSP algorithms required in the physical layer (such as FEC and modulation) and simultaneously provide overall system integration and flexibility by allowing the designer to implement higher communication layers, such as the DLL, in embedded processors. This paper is organized as follows. Section II presents the VLC system's requirements, DLL and FEC module's design. Sections III and IV describe their FPGA implementations and present main performance results, respectively. Conclusions are given in section IV.
II. SYSTEM REQUIREMENTS

A. Modular Arquitecture
A key characteristic of the proposed system is its asynchronous architecture, where modules communicate through elastic buffers. Using this architecture, all blocks are independent, making their design and testing simpler. Also, it eases the task of adding or removing blocks, as there are no concerns about guaranteeing a synchronous communication path between them. Finally, this approach can maximize system's performance due to: a) shorter critical paths, which are now confined to synchronous domains; b) reduced complexity and power consumption associated with distributing a low-skew and high frequency clock to the entire system -each synchronous block can operate at its minimum required speed; and c) reduced routing complexity, and simpler floorplanning. Unfortunately, higher performance comes at the price of higher complexity, as the designer must implement additional control modules to guarantee compliance with the asynchronous protocol. Nevertheless, we believe that performance gains clearly justify the effort. The system's asynchronous architecture is shown in Fig. 1 . The processing unit processes data until the buffer reaches the almost full state and resumes it when signaled that buffer is below a certain threshold. The controller block is responsible for signaling the processing unit of such cases while keeping track of the buffer's fill level. This paper will describe the implementation and performance of DLL and FEC blocks. A description of other transceiver blocks (modulator and demodulator) can be found in [3] .
B. DLL Architecture
Two types of value added services were considered: a MDR to furnish adequate means for control and management, advertising and infotainment services; and a HDR for video broadcast. The broadcast nature of this VLC system requires the use of a DLL to arbitrate the access among these multiple services while enabling flow control and reliability on the transmission. With this goal in mind, OMEGA Sequence Control concept [4] was adapted and introduced in the proposed DLL frame shown in Fig. 2 . The payload and header sizes are 200 and 8 bytes, respectively.
The Sequence Control is intended to fragment both services into the same frame. It is composed by a More Fragmentation flag, Fragment Number and Request Size for both types of services. The fragmentation flag is set if the transmission frame data field holds data that is part of a larger fragmented packet. The fragment number is set to the number of the fragment within the frame and it is zero in the first or only fragment. Request size is set to the fragment byte size of the service request. The Protocol Version field was inserted regarding future versions and optimizations on the proposed DLL, and is currently set to 1.
The broadcast nature of this system makes source address and destination address unnecessary but with localization purposes and user handover warnings in mind, Source Address was inserted in the frame. Due to OFDM usage in the project, start and stop beacons were not considered. OFDM already provides frame synchronization because it identifies in the receiving end where the DLL frames start and provides the upper layers with the useful data.
In order to accommodate all the DLL frame parameters, the DLL emitter was structured into five blocks: i) Operations Controller -responsible for managing the remaining blocks, as depicted in Fig. 3 ; ii) Admission Control -controls the admission of data; iii) Fragmentation Control -fragments the data into frames; iv) Header Coder: responsible for header calculation; and v) Link Management: controls the outgoing data flow. All of these blocks, with the exception of Fragmentation Control, are connected to a shared memory where the frame is formed before exiting the DLL. On the DLL receiver side, header interpretation occurs with blocks that revert these operations.
C. FEC Architecture
FEC codes play an important role in communication systems. They make them less error prone, which directly translates in to a system performance increase. In VLC systems, two types of errors are usually considered: i) random errors which occur unpredictably with a certain probability and ii) burst errors which are characterized by a sequence of errors.
Both Reed Solomon (RS) and Convolutional codes were initially considered due to their popularity in communication systems. However, as VLC systems are characterized by having both random and burst errors, Reed-Solomon codes were the ones selected for implementation. They present higher error correction as well as higher efficiency for moderate Signal-to-Noise ratios (Eb/No>5.8dB), which are expected in these systems. RS efficiency depends on the ratio between the data symbols to be transmitted (K) and the total number of transmitted symbols (N), as shown in (2) . The N-K parity check symbols guarantee the code correction up to t = 0.5*(N-K) symbols and can detect up to N-K erroneous elements. To increase RS burst error correction, Interleaving techniques are a must have. Convolutional Interleaving and Block Interleaving were in this study spotlight. In the receiver, after the Block Deinterleaver, burst errors will be spread in different codewords, resulting in smaller error lengths. Convolutional Interleaving was not considered since they can scatter errors beyond the RS correction capability.
To further improve system's reliability, 32-bit Cyclic Redundancy Check (CRC) was also implemented, enabling the validation of the received decoded data.
D. DLL Frame Size and FEC Code Selection
To find the most adequate frame size for the envisioned application, the frame size must be selected in order to maximize both the DLL and FEC frame efficiency. The first depends on the ratio between payload (P) and frame (F) size, both in Bytes, as shown in (1), given that F includes header and payload.
DLL Frame Efficiency = P/F
(1)
FEC efficiency depends on the selected code and its design parameters.
RS code efficiency
To find an adequate value for these parameters, other video broadcast architectures like the Digital Video BroadcastingTerrestrial (DVB-T) [5] , Satellite (DVB-S) [6] , and Advanced Television Systems Committee (ATSC) [7] standards were analyzed. Also, a fixed frame size is necessary to ensure the lighting functionality. Thus, the fixed size container format MPEG Transport Stream (MPEG-TS) [8] was chosen instead of a dynamic one such as Generic Stream Encapsulation (GSE). In the MPEG-TS standard, the frame size is 188 Bytes long, including a minimum 4 Byte header. Thus, the K field of RS code should be greater than 200 Bytes: 188 Bytes from MPEG-TS frame, 8 Bytes from the DLL header and 4 Bytes for the CRC-32 field.
With this constraint in mind, as well as the goal of 60 minutes error free broadcast of Ultra HD video streams (~25Mbps) [9] , K parameter was selected as a compromise between Frame Error Rate and system efficiency. System efficiency is given by RS and DLL combined efficiencies, as shown in (3).
Total efficiency= P/N (3)
In the light of these constraints, the selected code was an RS(255,213) and a DLL frame size of 208 Bytes, which provides payload size of 200 Bytes and a system efficiency of 78,4%. This frame size was also shown to be a good choice considering the Central Direct Memory Access address alignment needs, as will be explained in section III.A.
III. FPGA IMPLEMENTATION
The Spartan-6 FPGA SP605 Evaluation Kit was used for both DLL and FEC module's implementation and performance evaluation presented here. However, these modules can be easily ported for higher-performance FPGAs as they make use of features also supported in those devices. Moreover, there are compatible development boards (similar interfaces and external memory resources) with Virtex or Kintex FPGAs, such as ML605 or KC705 boards. Regarding software tools, the DLL module was implemented using the Xilinx Embedded Development Kit (EDK) while the FEC module was implemented using System Generator.
A. DLL Module
DLL module was implemented in MicroBlaze, which is a 32-bit Reduced Instruction Set Computer (RISC) Harvard architecture soft processor core, optimized for implementation in Xilinx FPGAs. MicroBlaze supports three different DMA engines IPs, but the most adequate for this module is AXI CDMA due to the required asynchronous architecture (shown in Fig. 1 ). It provides high-bandwidth direct memory access between a memory-mapped source address and a memorymapped destination address using the AXI4 protocol. The usage of the AXI CDMA provides data transfer between DLL and its adjacent blocks, while offloading the CPU from the data transfer and enabling his usage to the control and sequencing tasks of the DLL only. While data transfer is achieved with AXI4 protocol, the initialization, status, and management registers are accessed through the use of an AXI4-Lite protocol, suitable for the MicroBlaze microprocessor.
Figure 4 -Microprocessor configuration
As mentioned in section II, DLL frames have to be sent periodically due to lighting requirements and to fulfill the project asynchronous architecture criteria of buffer fill levels. Thus, an AXI Timer and an Interrupt controller were included as processor peripherals, both with AXI4-Lite interfaces. Processor frequency was set to its maximum (100MHz), Local Memory Size set to 32kB, Instruction Cache set to 16kB, Data Cache set to 16KB and default peripherals removed with the exception of RS232 UART (required for user interface), and DDR3. System assembly overview is shown in Fig. 4 .
The DLL functionalities were implemented in C++ code, structured in four classes, as shown in Fig. 5 : DLLclasscontains DLL functions; DLLFclass -includes DLL frame parameters; FIFOclass -accommodates all functions for higher layer requests; and FPGAclass -has all the required driver Application Programming Interface (API) functions for the CDMA, Timer and Interrupt Controller usage. This code division has many advantages: i) provides a cleaner and less error prone programming; ii) enables accurate code debugging, as every phase of the processing can be tested independently; iii) allows data encapsulation, i.e., the implementation details of a class are kept hidden from the remaining ones; and iv) increases code portability (to be used in other platforms).
It is also important to mention that DLL functions were organized in such way that a transfer of the CDMA will be preceded and proceeded by a processing function, allowing transfers to occur while frame processing is ongoing. This allows the system to use simple DMA transfers, which are faster than other options because of its software and hardware simplicity. A simple DMA transfer only needs the source buffer address, the destination buffer address and the transfer length to perform the DMA transfer. Although only one transfer can be submitted to the hardware each time, it is sufficient for this application because one DMA transfer is enough to send the whole frame.
CDMA data transfer width and DDR3 data port settings were also configured for best performance. Possible data transfer widths are determined by the Xilinx Memory Interface Generator (MIG) that is responsible for generating memory interfaces for the Xilinx FPGAs, and accepts only 32, 64 and 128 bits of port configuration values. None of these configurations present issues in the data alignment given that the DLL frame size is 208 Bytes. By analyzing the TransferCDMA function time during preliminary DLL tests, we observed that 64 bits configuration out-performs the others, as shown in Table I . Bigger widths were expected to result in higher efficiency, but MIG and AXI bus clock limitations in SP605 Evaluation Kit restrict the 128-bit DLL performance.
B. FEC Module
As stated above, the FEC module is composed by CRC-32, RS(255,213) and Interleaver blocks. The transmitter blocks are shown in Fig. 6 . Similar ones were also developed for the receiver, which implement the complementary functions, but are not shown here for the sake of conciseness.
Each block was implemented according to the asynchronous architecture described in section II. Thus, each Figure 5 -DLL Class Scheme includes a processing block, an elastic buffer and respective controlling machines. Buffer level controller provides an almost empty (AE) signal to the following block along with a read enable (RE) signal to the previous block to guarantee a constant mean data flow between blocks. The processing block of CRC-32 block was implemented in VHDL using [10] , following the IEEE 802.3 Ethernet standard, and incorporated in System Generator by the use of a Black Box component.
Other blocks made use of existing Xilinx IP cores, namely Interleaver 7.1 and Reed Solomon Encoder 8.0. Since the RS RS(255,213) is not a standard, the encoder's parameters had to be configured as following: fixed number of check symbols and block length; 8 bits per symbol; unitary scaling factor; field polynomial equal to 285 10; N=255; and K=213. The interleaver was also configured with 8 bits per symbol, 10 rows and 255 columns, to match the RS(255,K) symbol size. The size of output FIFO was set to 8kB to provide enough storage for three complete Interleaver matrixes (30 RS code words). Furthermore, the buffer's full level threshold was adjusted to guarantee that the Interleaver processor can output the full matrix at once without overflow.
IV. PERFORMANCE RESULTS
This section describes the performance evaluation setup and results for both DLL and FEC modules implemented in the SP605 board. Post-synthesis resource utilization data for both modules is shown in Table II. Note that values are not that significant as the target device is a low density Spartan-6 (XC6SLX45T).
A. DLL Module
The processor timer was used to evaluate DLL efficiency in terms of throughput. Xilinx SDK profiling tools were also used but with more unreliable results. These tools are helpful to identify bottlenecks that might occur in the developed code (due to the interaction of functions that are executed within the programmable logic and functions executed on the processor), but results are only estimates [11] . Several tests were conducted for profiling the CDMA function, considering a 64-bit width DLL, a fixed profiling bin size (8 and 16bytes) and different profiling frequencies (from 2 to 3.5MHz). As expected, all profiling results are higher than the 8μs measured with timer (Table III) . This is due to the software intrusive nature of the profiling that requires the program to be periodically interrupted to obtain a sample of its program counter location and store the profile information in memory.
To measure the time of the DLL functions regarding all plausible fragmentation cases, four scenarios were established: i) no services fragmentation; ii) 250 Bytes HDR request with fragmentation -first frame with 192 bytes and second with 8 bytes; iii) 250 Bytes MDR request with fragmentation -similar to previous; and iv) fragmentation of both services -250 Bytes of HDR request and 18 Bytes of MDR request which results in a first frame with 192+8 bytes and a second with 58+8 bytes. The computation time of each DLL function (as depicted in Fig.3 ) was measured 5 times and an average of five samples computed. The DLL Emitter and Receiver computation times (T DLL) were then obtained by summing the respective functions' times. Finally, throughput (RDLL) was computed using (4). The results measured in hardware (after place & route) are shown in Table III , for each considered scenario.
RDLL = 8 × P / TDLL (4)
As can be seen in Fig. 7 , the throughput presents a linear evolution with frame size. The frame size increase presents no added computing delay and shows the potential to enhance further versions of the proposed DLL to bigger frame sizes. However, since frame size was selected with MPEG-TS encapsulation in mind, the proposed DLL frame size needs to be chosen regarding the HDR encapsulation; MPEG-TS encapsulation has 188 Bytes. As will be further explained, DLL frame sizes higher than 208 Bytes would not bring any advantage for this case study. The choice of bigger frame sizes is only required if MDR requires it; Otherwise, most of the payload would be left empty. Thus, as mentioned, part of this study was to establish a compromise between the required payload size for both services and system efficiency. Besides this, smaller frame sizes reduce the latency and make the system less error prone. 
B. FEC Module
To evaluate the FEC module performance, we resorted to hardware co-simulation. Due to the system's complexity, simulations in the Simulink environment would take several hours or even days to perform a single simulation. Hardware co-simulation is a System Generator feature that makes use of the FPGA hardware to run the model with inputs and outputs taken from/to System Generator via JTAG.
To avoid JTAG to be a simulation bottleneck, performance evaluation modules were also implemented in hardware: data generator; error generator; and error rate calculation; all with configurable parameters. This way, configuration parameters and statistical error rate are sent via JTAG. In addition, hardware co-simulation models used synchronous Shared FIFOs, which have separate read and write clocks. Using these FIFOs, the FPGA clock and the System Generator sample time can be completely different, allowing the hardware implementation to run at the FPGA clock speed while the System Generator only takes few values from the FIFO block at a much slower rate.
In order to obtain the system transmission rate, an accumulator was added to the System Generator model. This block counts the number of clock cycles (N clk) and transmitted frames (Nframes) and stores them into a shared FIFO. The throughput (RFEC) can be calculated using the (5), where P is the payload size and Tclk = 20ns is the FPGA clock period. From this equation, and using the Nframes and Nclk data obtained through co-simulation, the FEC module has been shown to be able to transmit up to 65.3Mbps of data.
RFEC = 8×P×Nframes/Nclk×Tclk
Frame Error Rate (FER) performance was also evaluated by varying the Error Probability value in the error generator block. We compared results with the expected FER, considering that it is equal to the Block Error Rate (BLER) parameter (one codeword per frame). FER is related to the error probability of an existing wrong symbol in a code word and is given by (6) . Results are presented in Fig. 8 , showing a good agreement between expected and measured performance. Also, it has shown the benefits of the proposed asynchronous architecture, despite the additional increased complexity. Finally, this work is expected to open ground to the development of a future standard for data packaging and requests management in VLC broadcast research community.
