During the next major shutdown (2019-2020), the ATLAS experiment at the Large Hadron Collider (LHC) will adopt the Front-End LInk eXchange (FELIX) system as the interface between the data acquisition, detector control, timing, trigger, and control (TTC) systems and new or updated trigger and detector front-end electronics. FELIX will function as a router between custom serial links from front-end application-specified integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) to data collection and processing components via a commodity switched network. Links may aggregate many slower links or be a single high-bandwidth link. FELIX will also forward the LHC bunch-crossing clock, fixed latency trigger accepts and resets received from the TTC system to front-end electronics. The FELIX system uses commodity server technology in combination with FPGA-based Peripheral Component Interconnect express I/O cards. The FELIX host servers will run a software routing platform serving data to network clients. Commercial off-the-shelf (COTS) servers connected to FELIX systems via the same network will run the new Software Readout Driver (SW ROD) infrastructure for event fragment building and buffering, with support for detector or trigger specific data processing, and will serve the data upon request to the ATLAS High-Level Trigger for Event Building and Selection. This paper will cover the design and status of FELIX, results of early performance testing, and integration tests with several ATLAS front-ends.
I. INTRODUCTION
T HE Large Hadron Collider (LHC) will undergo a series of significant upgrades in the next 10 years, which increase both collision energy and peak luminosity. As one of the four major experiments, the ATLAS experiment will also follow the same upgrade steps [1] . The Front End LInk eXchange (FELIX) is a new detector readout component being developed as part of the ATLAS upgrade effort [2] . FELIX is designed to act as a data router, receiving packets from detector frontend electronics and sending them to programmable peers on a commodity high-bandwidth network. In the ATLAS Run 3 upgrade, FELIX will be used by the Liquid Argon (LAr) The author is with the Brookhaven National Laboratory, Upton, NY 11973-5000 USA (e-mail: weihaowu@bnl.gov).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNS.2019.2913617 Calorimeter, Level-1 Calorimeter (L1Calo) trigger system, BIS 7/8 and the New Small Wheel (NSW) muon detectors, as shown in Fig. 1 [3] , [4] . In the ATLAS Run 4 upgrade, the FELIX approach will be used to interface with all ATLAS detector and trigger systems. FELIX brings multiple improvements in both performance and maintenance of the full data acquisition (DAQ) chain. Since the FELIX system maximizes the use of commodity hardware, the DAQ system can reduce its reliance on custom hardware. Furthermore, additional commercial offthe-shelf (COTS) components can be easily connected to resize the FELIX infrastructure as needed. The FELIX system implements a switched network architecture that makes the DAQ system easier to maintain and more scalable for future upgrades [5] . The FELIX architecture meets the following requirements.
1) It should be detector independent.
2) It must support the CERN standard gigabit transceiver (GBT) protocol with all its configuration options to connect to front-end units having radiation hardness concerns [6] . 3) It must distribute timing, trigger, and control (TTC) signals via fixed latency optical links. 4) It must route data from different GBTx E-links to configurable network endpoints. E-links are low bandwidth (80-320 Mb/s) serial electrical links that are aggregated into a single high speed (4.8 Gb/s) GBT optical link. 5) For the ATLAS Run 4 upgrade, FELIX should also support fast calibration operations for front-end units, by implementing a mechanism to send control commands and distribute data packets simultaneously at high throughput, with a synchronization mechanism that does not involve network traffic. In this paper, we introduce the FELIX hardware platform in Section II, the firmware design in Section III and software features in Section IV. The status of integration activities with several ATLAS front-end units is described in Section V.
II. FELIX INTERFACE CARD
The FELIX hardware platform has been developed for the final implementation in the ATLAS Run 3 upgrade. It is a standard height Peripheral Component Interconnect express (PCIe) Gen3 card. The latest version is named as the FLX-712, as shown in Fig. 2 . It is based on a Xilinx Kintex UltraScale field-programmable gate array (FPGA) (XCKU115-FLVF-1924) [7] capable of supporting 48 bidirectional high-speed optical links via on-board MiniPOD transceivers, with a 16-lane PCIe Gen3 interface. In comparison to the previous version (FLX-711), the FLX-712 no longer hosts the unnecessary DDR4 Small Outline Dual In-Line Memory Module connectors [8] . This eases Printed Circuit Board routing and also makes the board shorter. Since the FPGA has two Super Logic Regions (SLRs), two 8-lane PCIe endpoints are implemented in separate SLRs to achieve a balanced placement and routing which allows more channels to be serviced and easier timing closure. Fig. 3 shows the functional block diagram of the FLX-712. Since the Xilinx UltraScale FPGA supports, at most, 8-lane PCI Express, a PCIe switch (PEX8732) [9] is used to connect two 8-lane endpoints to the 16-lane PCIe slot. This approach ensures that it is possible to achieve the required nominal bandwidth of 128 Gb/s. There are 4 transmitter MiniPODs and 4 receiver MiniPODs on board; each one has 12 high-speed Rx or Tx links connected to FPGA GTH transceivers [11] . The speed of these 48 optical links can be up to 14 Gb/s, due to the limitation of the MiniPODs. An onboard jitter cleaner chip (Si5345) is used to provide a low-jitter reference clock, at an integer multiple of the bunch-crossing (BC) clock, for the GTH transceivers. All of the hardware features of FLX-712 have been successfully verified. To test the PCIe interface, two Wupper direct memory access (DMA) engines were implemented in the FPGA, which are discussed under Section III-F. Counter patterns were then used to test the throughput to the host server. The total measured throughput of these two 8-lane PCIe Gen3 endpoints can be up to 101.7 Gb/s, in agreement with the PCIe specification. To test the optical links, the Xilinx IBERT IP [10] was used to perform bit error rate (BER) and eye diagram tests at line rates of 12.8 and 9.6 Gb/s [11] . The results show that the BER is smaller than 10 −15 for all of the 48 optical links.
In order to make the update of the FELIX firmware easier, an onboard parallel flash can store four different firmware bitfiles in separate partitions. These stored bitfiles can be updated and verified by FELIX software tools via the PCIe interface. A microcontroller (ATMEGA324A) [12] is used to control the reconfiguration of the FPGA from selectable bitfiles stored in the flash memory. Software tools in the host server communicate with the microcontroller via the System Management Bus. The microcontroller reads the status of onboard switches and uses it as the I 2 C slave address, which can be used as the board ID. The flash partition selection can be controlled by the FPGA, the microcontroller, and by jumpers. FPGA firmware has the highest priority and the jumpers have the lowest priority.
A mezzanine card has been developed to receive TTC information. It is connected to the FLX-712 via a Samtec SEARAY connector, as shown in Fig. 4 . It can be populated to interface to the LHC legacy TTC, TTC-Passive Optical Network, or White Rabbit systems. In the configuration for the legacy TTC system, an onboard clock and data recovery application specified integrated circuit (ASIC) (ADN2814) is used to recover the 160-MHz LHC TTC clock and data.
The FPGA is equipped with a standard axial fan heat-sink. It is estimated that the whole FLX-712 card will consume less than 64 W. The air flow in the server should be sufficient and no additional cooling appears to be required (the card consumes less power than a GPU). The temperature of a FLX-712 card installed in a host server can also be checked with software tools. For a project with 46 links of about 5 Gb/s, 
III. FELIX FIRMWARE
The FELIX firmware supports two modes: GBT mode and FULL mode. GBT mode uses GBT architecture and a protocol developed by CERN providing a bidirectional high-speed (4.8 Gb/s) radiation-hard optical link [6] . FULL mode uses a customized lightweight protocol for the from front-end data path, providing a higher maximum payload at a line rate of 9.6 Gb/s. As FULL mode uses 8b/10b encoding, a maximum user payload of 7.68 Gb/s can be achieved. The main functional blocks of the FELIX firmware, shown in Fig. 5 , consist of a GBT wrapper, Central Router, PCIe DMA engine, and other modules. Two sets of firmware modules are instantiated in the top-level design to have a balanced structure and to ease FPGA net routing.
A. TTC Decoder
In addition to routing front-end data streams, FELIX also distributes TTC information to front-end electronics from the TTC system. The TTC decoder firmware module is based on the TTC firmware from the CERN GLIB project [13] . It receives the clock and serial TTC data from a TTC optical fiber via a clock and data recovery chip (ADN2814). The serial TTC data contain two interleaved data streams: the A-channel, reserved for the Level-1 Accept, and the B-channel, which carries other commands such as bunch counter reset (BCR). The A-and B-channels are interleaved bit by bit, and the B-channel is further encoded by a Hamming code. The correct alignment of the 40.08-MHz LHC bunch-crossing clock must be deduced from the A-and B-channel streams. A state machine is used to sample these two data streams with the 160.32 MHz recovered clock from the ADN2814. It also separates the A-channel and B-channel information, extracts broadcast commands from the B-channel data stream, and provides a 40.08-MHz clock aligned to the bunch crossing clock. This clock is used to choose the correct phase of a 40.08-MHz clock generated by a Xilinx clock management module [Mixed Mode Clock Manager (MMCM)] in the FPGA from the 160.32-MHz recovered clock. The architecture of the TTC decoder is shown in Fig. 6 .
B. Clock Distribution
The generated 40.08-MHz TTC clock from the MMCM shown in Fig. 6 is distributed via a dedicated clock net to the rest of FPGA fabric. Due to the low jitter requirement of the high-speed GTH transceivers, their reference clock is provided by the on-board jitter cleaner (Si5345) that multiplies the frequency and cleans the jitter. Fig. 7 shows which clock signals are generated and how they are used. For test purposes, it is also possible to use a local oscillator as the master clock.
C. GBT Wrapper
The FELIX GBT wrapper is based on the CERN GBT-FPGA firmware with several performance improvements [14] . It is used to interface with front-end electronics via optical links, which is shown in Fig. 5 (left) . It encapsulates the forward error correction (FEC) encoder/decoder, a scrambler/descrambler, and a gearbox architecture. To decrease the latency, the frequency of the FEC encoder/decoder and scrambler/descrambler clock domain was increased to 240 MHz [15] . The GBT protocol supports GBT frameencoding mode and wide-bus mode [14] . The wide-bus mode is not radiation tolerant, as the FEC encoder and decoder are sacrificed in the to-host direction in favor of a higher user payload. In order to allow choosing between the GBT frame-encoding mode and wide-bus mode at runtime, two multiplexers are added: one for the FEC encoder and the other for the FEC decoder. A finite-state machine (FSM) is implemented for automatic alignment of the GBT RX data stream. The registers of this GBT wrapper are mapped to the PCIe interface to allow software tools to control and monitor its status.
D. Central Router
The Central Router shown in Fig. 5 routes and formats the data streams between the GBT wrapper and the PCIe DMA engine. It handles two data path directions independently. On the GBT side, it implements a data manager for each link supporting GBT frame-encoding data. On the PCIe engine side, there is a FIFO with a 256-bit wide port. For each GBT data frame, there are five E-groups in each direction. Each E-group transfers 16 bits of data that consist of several E-links at 40 MHz. E-links are low-bandwidth data streams with four possible data widths of 2, 4, 8, and 16 bits, corresponding to data rates of 80, 160, 320, and 640 Mb/s. The 640-Mb/s data rate uses two adjacent 320 Mb/s lanes since it is not supported directly by the GBTx ASIC [6] . The Central Router processes these E-link data streams separately with header and trailer information added.
E. BUSY and Flow Control
FELIX supports both a BUSY and a flow control architecture, as shown in Fig. 8 . The assertion of a BUSY signal is a request to the Central Trigger Processor (CTP) to stop generating Level-1 Accept triggers which eventually stops the data flow. Components in the data flow both upstream and downstream from FELIX can send busy-on and busy-off requests. Because BUSY assertion forces ATLAS dead time, its use should be limited to stopless recovery, start of run or emergency situations when buffers are almost full. Each FLX-712 card is capable of asserting BUSY via a LEMO connector on its panel. Flow control, on the other hand, recognizes that the congestion is likely only temporary and, assuming the data source has sufficient buffers, transmission can be paused without harm or data loss. FELIX can issue XON and XOFF flow control signals to its input links when its buffers become full. FULL mode uplinks, typically driven by FPGAs with buffers, may also support flow control. GBT mode uplinks, typically driven by front-end ASICs with small derandomizer buffers, so far do not handle flow control.
F. PCIe Wupper
PCIe firmware, called Wupper, was designed to provide a simple DMA interface for the Xilinx PCIe Gen3 hard block [16] . It transfers data between a 256-bit wide user logic FIFO and the host server memory, according to the addresses specified in DMA descriptors. Up to eight descriptors can be queued to be processed sequentially. Since the Xilinx PCIe Gen3 hard block only supports a maximum of eight lanes, the FPGA implements two 8-lane PCIe endpoints with separate DMA engines. For each 8-lane PCIe Gen3 endpoint, the throughput achieved is the theoretical maximum of 64 Gb/s. For FLX-712, the 16-lane PCIe interface can support effective data rate of more than 100 Gb/s. Eight DMA descriptors, with an address, a read/write flag, the transfer size (number of 32-bit words), and an enable line, are mapped as normal PCIe memory or input/output (IO) registers. The block diagram of the Wupper design is shown in Fig. 9 . Its functional blocks can be categorized into two groups: DMA control and DMA write/read. The DMA control parses and monitors received descriptors. It also makes the descriptor status available to software via the PCIe interface. Depending on the address range of the descriptor, the pointer to the current address is handled by DMA control and incremented every time a transaction layer packet (TLP) completes. DMA control can handle a circular buffer DMA if this is requested by the descriptor. DMA control also contains a register map, with addresses of the descriptors, status registers, and external registers for the user space register map. The DMA write/read blocks process the data streams for both directions. If the received descriptor is a to-host descriptor, the payload data are read from the user logic FIFO and added after the header information. If the descriptor is a from-host descriptor, the header of received data is removed and the length is checked; then the payload is shifted into the FIFO.
IV. FELIX SOFTWARE
The FELIX software suite has different layers: for example, low-level software tools, test software, and production software. The test software is for development purposes, and the production software tools are released to users for applications. These software tools are developed in C/C++, which are running in Linux CentOS 7 or Scientific Linux CERN 6 system. Access to the FELIX hardware level is controlled via two device drivers in the form of kernel modules: flx and cmem_rcc. The flx driver is a conventional character driver for PCIe interface cards. Its main function is to provide virtual addresses for the registers of an FLX-712 card that can be used directly by user processes to access the hardware. This design avoids the overhead of a context switch per IO transaction and is therefore essential for better performance. The cmem_rcc driver, from the ATLAS trigger and TDAQ project, allows the application software to allocate large buffers of contiguous memory. For use with FELIX, it has been tested for buffers of up to 16 GB.
FELIX supports dynamic configuration of E-links on GBT links, such as E-link width, encoding of an E-link's data, and whether an E-link is disabled or enabled. Such a configuration should match the configuration of the front-end GBT links. Though low-level tools can be used for this E-links configuration, Elink Configurator is a graphical tool developed to offer a user-friendly interface for creating and modifying a configuration, as shown in Fig. 10 . It displays a graphical representation of the division of E-links in 16 bits of the GBT frame (so-called E-group) for both to-host and from-host directions. The user can enable or disable any 16-bit E-group and define its E-links as needed. The Elink Configurator tool is also capable of saving the configuration to a local file and loading the configuration from a previously saved file. It supports the two modes in which the GBT links can be used, i.e., GBT frame-encoding and wide-bus modes, and also supports FULL mode links.
The Felixcore application handles the data between the FLX-712 card and a dedicated library called NetIO. Its functional architecture is shown in Fig. 11 . It does not perform any content analysis or manipulation of the data, other than that which is needed for decoding and transport. The DMA engine transfers a data stream into a contiguous circular buffer which is allocated using the cmem_rcc driver in the memory of the host server. Continuous DMA enables data transfer at full speed and does not require the DMA to be reset for each transfer. Data blocks retrieved from the circular buffer are inspected for integrity while extracting the E-link identifier and sequence number. The block is then copied to a selected worker thread based on the E-link identifier. The worker threads recombine the data stream for each E-link if any splitting for transport is required. Once the data reconstruction is complete, blocks are appended with a FELIX header and published to the network through NetIO.
NetIO is implemented as a generic message-based networking library that is tuned for typical use cases in DAQ systems. It offers four different communication modes: low-latency point-to-point communication, high-throughput point-to-point communication, low-latency publish/subscribe communication, and high-throughput publish/subscribe communication. NetIO has a backend system to support different network technologies and Application Program Interfaces (APIs). At this time, two different backends exist. The first backend uses Portable Operating System Interface sockets to establish reliable connections to endpoints, which is used for Transmission Control Protocol/IP connections in Ethernet networks. The second backend uses Infiniband, with an implementation Fig. 12 . Software ROD performance. One Software ROD is able to handle input from eight FELIX cards with event rate above 100 kHz.
using the Fabric Library [17] . Libfabric is a network API that is provided by the OpenFabrics Working Group.
A number of benchmarks have been carried out to evaluate the performance of Felixcore application and NetIO. These tests were run with a host server as the FELIX and another host as the data receiver. A 40-GbE connection was available between the hosts. In the GBT mode performance test, two FLX-712 cards were used to support 48 GBT links. The FLX cards were configured to the most demanding workload for the ATLAS Run 3 upgrade, with 8 E-links per GBT link and a chunk size of 40 Bytes. The system is comfortably able to transfer the full load at above the ATLAS L1 Accept rate of 100 kHz. Benchmarking for the FULL mode case also indicates its capability of handling 100 kHz event rate with data rate above 5.7 GB/s. A Software ROD (ReadOut Driver) is an application running on a commodity server that receives data from one or more FELIX systems and performs flexible data aggregation and formatting tasks. Incoming data packets associated with a given ATLAS event are automatically logically aggregated into a larger event fragment for further processing. The data are finally formatted to match common ATLAS specifications, as produced by the existing readout system, for consumption by the High-Level Trigger (HLT) on request. Benchmarks for the current aggregation algorithms, including realistic simulation of the cost of subdetector processing and HLT request handling, were carried out with simulated input data from multiple FELIX cards, each with 192 E-links and realistic packet sizes. The test results are shown in Fig. 12 . The algorithm is able to handle input from multiple FELIX cards. The performance correlates with host CPU speed and number of cores. One Software ROD processes input from up to eight FELIX cards at event rates greater than 100 kHz. The 1%, 50%, and 100% in the plot refer to the fraction of the events arriving at the software ROD which the HLT then samples.
V. INTEGRATION TESTS WITH DIFFERENT FRONT ENDS
For the upcoming ATLAS Run 3 upgrade in 2019, FELIX will be implemented to interface with several detector front-ends, such as the Muon Spectrometer's NSW, Liquid Argon (LAr) Calorimeter Trigger Digitizer Board (LTDB), and the L1Calo trigger system [3] , [18] . For the Run 4 upgrade of High-Luminosity LHC (HL-LHC), the plan is to adopt FELIX to interface with all detector front-ends.
A. Integration Test With New Small Wheel Front-Ends
In the NSW integration tests, FELIX successfully distributed TTC information to front-end electronics, including the bunch crossing clock and L1A trigger signal. The bidirectional dataflow toward front-ends has been demonstrated. FELIX can also trigger an front-end test pulse from a test application and successfully configure ASICs and FPGAs via the GBT-SCA's General Purpose IO, I 2 C, Serial Peripheral Interface, and Joint Test Action Group interfaces [19] . Other highlights also include the ability to read out Analog-to-Digital Converter (ADC) monitoring data and configure the GBTx on the Level-1 Data Driver Card board [20] . Taken together, these tests provide a robust demonstration of the functionality of the IC and SCA links in the GBT frame [6] .
B. Integration Test With Liquid Argon Calorimeter LTDB
In the LAr Calorimeter Run 3 upgrade, the LAr Trigger Digitizer Board (LTDB) digitizes input analog signals and transmits them to the back-end system [4] . There are five GBTx and five GBT-SCA chips on the LTDB prototype. Five GBT links in total from FELIX are connected to the LTDB. Part of the connection scheme (one GBT link) is shown in Fig. 13 . GBT-SCA chips are used to control the power rails and I 2 C buses as well as perform on-board temperature measurement [19] . Besides the interface to External Control links with the GBT-SCA chip, each GBTx on the LTDB provides the recovered 40-MHz TTC clock from a FELIX GBT link to the ASICs of the NEVIS ADC and serializers LOCx2, and also sends the BCR signal to the LOCx2 ASIC [21] , [22] . In this integration test, FELIX can perform configuration and monitoring, as well as TTC information distribution successfully.
C. Integration Test With gFEX
The Global Feature Extractor (gFEX) is one of several modules that will be deployed in the L1Calo trigger system in the ATLAS Run 3 upgrade [23] . In the integration test of gFEX and FELIX, gFEX needs to recover the TTC clock from a FELIX GBT link at 4.8 Gb/s, and also receive TTC signals such as Level-1 trigger Accept and BCR signals. As for the to-host path, gFEX needs to send data to FELIX using FULL mode optical links at 9.6 Gb/s. A block diagram of the test setup is shown in Fig. 14. The test results show that gFEX recovers a stable TTC clock and receives the TTC information correctly. The latency of TTC signal transmission (from TTC system to gFEX through FELIX) is fixed and does not change under conditions such as transceiver reset, fiber reconnection, TTC system power cycling, and FELIX and gFEX power cycling. The FULL mode links from gFEX to FELIX have been tested with the pseudorandom bit sequence (PRBS)-31 data pattern. No error was observed and the BER is smaller than 10 −15 .
VI. CONCLUSION
FELIX is a readout system that interfaces custom links from front-end electronics to standard commercial networks in the ATLAS upgrade. FELIX also distributes the LHC bunch-crossing clock, trigger accepts, and resets received from the TTC system to detector front-ends through fixed latency optical links. It supports the CERN standard 4.8-Gb/s GBT protocol and a customized lightweight FULL mode which has a higher throughput of 9.6 Gb/s. The results of integration and performance tests with ATLAS front-end systems to date indicate that FELIX is on course to be ready for deployment in 2019.
