Introduction
The GBT firmware code published by the GBT-FPGA group in March 2010 was designed to allow a quick and simple instantiation of the GBT protocol in both Xilinx and Altera FPGAs. The main structure of the protocol was kept as simple as possible to allow every user to adapt and modify the code according to their needs. The study presented below will at first analyze the GBT firmware data-flow in terms of latency without including the platform dependent transceiver. Next we present latency optimizations, which were also analyzed in terms of utilization. Finally we present results of our tests in a real FPGA environment to ensure error free operation.
GBT firmware data-flow
Transmitter. Within the GBT-logic, data is shifted along a chain from one component to the next (figure 1). The data first pass through a scrambling unit to provide DC-balanced frames, and then through a Reed Solomon encoder. The frame structure of the data here corresponds to the SLHC frame [1] . To improve error correction capability an interleaving stage was added after the encoder. This complex encoding scheme eventually allows to recover up to 16 consecutive corrupted bits at the receiver stage [6] . The clock domain crossing from 40MHz to 120MHz is realized by a dual port memory module. An internal control unit makes sure that this multiplexing unit operates properly with 120 and 40 bits. In the last stage the data are serialized using a hard-IP transceiver inside the FPGA, resulting in an overall line rate of 4.8Gbps.
Receiver.
A received data stream is first deserialized using the hard-IP transceiver. The results are 40 bit words at a frequency of 120MHz. The frequency for controlling the receiving part of the GBT-logic can be recovered from the incoming data stream and then used to clock the rest of the design. Since the Virtex 6 transceiver does not include an alignment block compatible with the SLHC frame this must be done in separate logic blocks, the frame alignment and pattern search units, resulting in 40 bits of parallel data with the header always at the beginning of one of the three words. The rest of the receiver logic chain works in the same way as its counterparts in the transmitter. 
Latency optimization
In this study, our aim was to minimize the latency contribution from every single component within the chain. The latency was first estimated by looking at event statements embedded in the VHDL code, and later confirmed by simulation. Analyzing the source code distributed with the Starter Kit resulted in a latency of approximately nine 40MHz cycles for the transmitting chain and eight 40MHz cycles for the receiving chain. Figure 1 shows the estimated latency of the different sections. At the end we achieved two different optimizations. Optimization 1 (Opt. 1) used the same components as the Starter kit but optimized for latency, while Optimization 2 (Opt. 2) replaced some components entirely to bring the latency to the lowest possible limit.
Transmitter. Starting at the beginning of the transmission chain the code of the scrambling unit was optimized for latency, resulting in one 40MHz cycle. The largest latency (seven 40MHz cycles) comes from multiplexing the data from 40MHz to 120MHz (Mux). As described previously this is realized with a dual port memory module, where optimizing the corresponding control-unit can improve the timing performance significantly, leading to two 40MHz cycles (Opt. 1). To decrease this latency to one 40MHz cycle the Mux was replaced with a register controlled by a finite state machine working at 120MHz (Opt. 2).
Receiver. Starting from the deserializer the data passes through the frame alignment and pattern search circuits. Although the pattern search circuit is necessary to ensure alignment, there is no need for the data to pass completely through the circuit. Knowing this, the structure of the data flow was modified as shown in figure 2 . Going further into the receiving chain revealed that the demultiplexing from 120MHz to 40MHz (Demux) can be optimized in the same way as the previously described Mux, leading to a latency of two 40MHz cycles (Opt. 1) or one 40MHz cycle (Opt. 2).
Summary. The latency for the modified GBT protocol code, excluding the serializer and deserializer, is roughly seven 40MHz cycles for Optimization 1. In detail, three 40MHz cycles are contributed from the modified GBT transmitter code and three 40MHz cycles plus one 120MHz cycle from the modified GBT receiver code. This optimization does not give the shortest latency but should save logic resources since less registers are used. This is investigated in the next section. The same estimation for Optimization 2 results in roughly five 40MHz cycles. The modified GBT transmitter code contributed two 40MHz cycles and the modified GBT receiver code contributed two 40MHz cycles plus one 120MHz cycle. Apart from the low latency this optimization is also platform independent, since it does not use vendor specific hard-IP cores.
Utilization studies
It is important to note that changing the GBT protocol code does not significantly increase the resource utilization. The results of this study are shown in figure 3 . As a reference the previously described Virtex 6 with 301,440 slice registers and 150,720 slice look-up-tables is used. On the transmitting side one can see that logic utilization is not increased. Instead the optimized code which uses the memory modules strongly decreases the number of slice registers and slightly reduces the number of slice look-up-tables. Replacing the memories with registers increases the number of slice registers, almost to the original value. The number of look-up-tables is also slightly increased compared to the originally code. Summarizing one could say that the optimization does not significantly increase the utilization and actually decreases it for Optimization 1. Looking at the utilization of the receiving side gives a slightly different picture. For the first optimization with dual port memory, the number of slice registers is slightly decreased, whereas the number of slice look-up-tables is slightly increased. Comparing this with the final optimization without memory modules one can see that the number of slice registers increases as expected, while the number of slice look-up-tables decreases. This means that using Optimization 1 provides more balance with respect to the number of complete slices used.
Test design and measurements
Different optimized designs were tested for reliability. The test system consisted of a Pseudo Random Number Generator (PRNG) to generate a test pattern, a GBT-transmitter, a GBT-receiver, one Gigabit Transceiver and a comparator, all contained in a Virtex 6 FPGA. After generating the pseudo random pattern, it was transferred to the GBT-protocol and the comparator. From the GBT-protocol logic, the signal was transferred to the transceiver. The output of the Gigabit Transceiver was then externally looped back to the input using coaxial cables with a length of 10 cm that directly connected the output of the transceiver to the input. After receiving the data, the GBT-protocol aligned automatically, reconstructed the received data and send it to the comparator, where the generated and received data were compared. The latency of the whole test chain was determined by observing the time difference between the two data streams.
During the test, the transmission line rate was set to 5Gbps rather than 4.8Gbps. This was necessary due to clocking restrictions of a ML605 development board from Xilinx which was used as one clock source. Therefore the frequencies had to be adapted to 41.667MHz and 125MHz. To provide a test system which was close to a real setup, the clock sources of the transmitter and receiver were completely separated using two independent external oscillators. One of these was situated on the ML605 board which was used for the test system while the other was situated on a Xilinx ML507 evaluation board [7] . This reference clock signal was used to supply the serializer (light green area in figure 4 ). The 125MHz clock signal for the deserializer came from an on board source (grey region in figure 4) , and was only used within the hard-IP receiver of the FPGA. A further clock signal with a frequency of 250MHz was recovered from the incoming data stream. Out of this, 41.667MHz and 125MHz clock signals were synthesized to control the receiving part of the GBT protocol (light yellow region in figure 4) . The measurements done with unmodified code and non-optimized transceivers showed a total latency of twenty-one 41.667MHz clock cycles. After optimization the latency of the test design was eight clock cycles for Optimization 1 and six clock cycles for Optimization 2. This latency includes the serialization and deserialization by the hard-IP transceiver, which in simulation was determined to six 125MHz cycles. Taking this into account, the results were consistent with the latency estimated for the two different optimizations. Each was tested more than two days without a single error. This corresponds to a lower BER limit of 2×10 -15 with a confidence level of 90%.
