Index Terms-ATLAS, detector control system, fieldprogrammable gate array (FPGA), pixel detector, read-out driver, trigger and data acquisition (TDAQ).
I. INTRODUCTION

I
N the next few years, the Large Hadron Collider (LHC) at CERN will extend its investigation of the fundamental structure of physical matter by undergoing a series of major upgrades [1] . All the experiments located on LHC circumference (such as the ATLAS experiment) will be upgraded as well, in order to reach the performance required for the physics challenges foreseen for the future. The following two key factors affect all the detectors: the increase of instantaneous luminosity-corresponding to an increase of the simultaneous collisions (pile-up) and hence of the event size-and of the trigger rate, which will be on the order of 1 MHz, ten times higher than the current rate of ≈100 kHz. The combination of those two factors constitutes a major challenge for the electronic readout systems, since it directly affects the total throughput, i.e., the amount of data transmitted per time unit. Hence, all the readout systems will have to provide a higher total bandwidth, capable of coping with the increased data throughput.
The purpose of this paper is to introduce a new readout card, called PIxel detector high Luminosity UPgrade board (πLUP), developed by a joint effort from University and INFN of Bologna as a proposed readout upgrade system for the ATLAS experiment. It was designed as a natural upgrade of the current ATLAS pixel detector readout chain [2] , as will be discussed in Section II. The πLUP is a Peripheral Component Interconnect Express (PCIe) card featuring two field-programmable gate arrays (FPGAs) connected in a master-slave architecture.
Apart from the PCIe, the πLUP card features a huge variety of I/O connectors, such as two Universal Asynchronous Receiver-Transmitter (UART) ports, one 1-Gbps Ethernet port, one 10-Gbps Ethernet port, one small form-factor pluggable (SFP+) connector, and three FPGA mezzanine card (FMC) connectors. Having a wide choice of different I/O interfaces attributes a great versatility to the πLUP card, making it perfectly suited to act as a general-purpose readout board. In fact, although it was designed to fulfill a specific task, it can be used to interface several types of front-end chips or electronic systems.
Two first prototypes of πLUPs (version 1.0) were produced in 2016. Most of the I/O connectors and the internal functionalities were successfully tested. However, some small patches were required and the shape of the board had to be revisited to properly fit one of the FMC connectors. Those revisions led to the fabrication of four new boards (version 1.1) in 2018; the two versions are shown in Fig. 1 .
In this paper, a technical overview on the main components of the πLUP board will be presented (Section III), as well as its possible applications in different projects (Section V) and the results obtained (Section VI).
II. π LUP DESIGN
The πLUP board is a readout board based on two FPGAs connected in a master-slave architecture. Its design was 0018-9499 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. conceived as a natural upgrade of the current ATLAS pixel detector data acquisition (DAQ) system, mainly composed of two electronic cards: Back of Crate (BOC) [3] that is responsible for handling the control interface to the detector and the data from the detector-and readout driver (ROD) [4] that is responsible for data processing and packaging. The ROD and BOC boards feature fifth generation (Virtex-5 [6] ) and sixth generation (Spartan-6 [5] ) Xilinx FPGAs. They are connected together through a Versa Module Eurocard (VME) crate and together provide a total bandwidth of 5.12 Gbps. On the other hand, the πLUP board abandoned the VME connector, moving toward the solution of eight lanes PCIe bus. By exploiting the most recent technologies, it also merges in a single board both the I/O and the data processing capabilities-as shown in Fig. 2 -and can provide a total bandwidth of 80 +32 Gbps. Mirroring the ROD structure, the πLUP features two FPGAs in a master-slave architecture. Both the FPGAs are from Xilinx (seventh generation); the master FPGA is a Zynq-7 [7] and the Slave is a Kintex-7 [8]. The Zynq-7 includes an embedded dual-core Advanced Reduced Instruction Set Computer (RISC) Machine (ARM) processor, which fulfills the same role as the PowerPC processor embedded in the ROD master FPGA. A board architecture implementing multiple FPGAs connected together by an internal bus proved to be an excellent choice during the past ATLAS pixel detector DAQ operations. In fact, the master FPGA is used to program the Slave FPGA, solving many problems related to the physical access of the boards and providing greater system control capabilities. Moreover, the embedded processor is constantly monitoring the status of the FPGAs and the front-end modules interfaced, performing immediate actions in case of misbehave (e.g., module reconfiguration if a module stops responding). The design of the πLUP was greatly influenced by those considerations and its structure replicates the ROD-BOC architecture. Table I summarizes the main differences between the ROD-BOC and the πLUP boards.
III. π LUP BOARD OVERVIEW
The πLUP card is a 16 layer PCIe board capable of interfacing to several different other boards or front ends and processing data at high speed. Fig. 3 shows the main components on the board.
The πLUP features two Xilinx seventh series FPGAs arranged in a master-slave architecture and connected together by a bus, namely, Kintex-Zynq bus (KZbus), composed of five single ended and 20 differential lines. A Zynq XC7Z020-1CLG484C-embedding a physical dual-core ARM Cortex-A9 processor-is the master FPGA and is in charge of controlling the data flow and status of the Slave FPGA, a Kintex XC7K325T-2FFG900C. The Kintex device handles all the high-speed I/O communications through 16 internal physical transceivers (GTx) [10] running at up to 12.5 Gb/s (although they are used at a maximum speed of 10 Gbps). The maximum I/O bandwidth of the board can be calculated as the eight lanes PCIe gen 2 bandwidth (32 Gbps) plus the maximum bandwidth of the other eight transceivers (80 Gbps).
A. Clock Distribution
Several clock sources are present on the board. The Zynq-7 FPGA is associated with three main clock sources. A 200-MHz system clock is provided by a SiTime SiT9102, a differential output programmable oscillator providing 10-ppm frequency stability with subpiscosecond phase jitter; a configurable user clock is provided by a Silicon Labs Si570, a low jitter oscillator that supports frequencies between 10 and 1400 MHz; the processing system (PS) clock is provided by a 50 ppm 33.33-MHz oscillator. The Kintex-7 device features another 200-MHz SiT902 system clock and programmable Si570 user clock, as well as other clock inputs required by the GTx transceivers as reference clocks [10] . Some of those reference clocks are embedded on the πLUP itself, while others must be provided from the outside. The two sources provided by the πLUP are a 125-MHz Ethernet reference clock, provided by the combination of a 25-MHz crystal oscillator and an Integrated Device Technology 844021I-01 crystal oscillator interface, and a programmable reference clock, provided by a Silicon Labs Si5326 jitter cleaner. The external reference clock sources must be provided by the PCIe connector, by the LPC and HPC FMC connectors or by the SubMiniature version A (SMA) connectors. Table II shows how the reference clocks are associated with different GTx transceivers.
IV. SOFTWARE ARCHITECTURE
The two FPGAs present on the board are intended to be used in a master-slave configuration, with the Zynq or more precisely its ARM-based PS, controlling any peripheral in the system and acting as a main interface to the user. A diagram of the setup is shown in Fig. 4 . Inside the Zynq, the PS communicate with the FPGA through the Advanced Microcontroller Bus Architecture (AMBA) AXI protocol. This channel is extended to the Kintex by the AXI Chip2Chip IP core [9] offered by Xilinx. This core transparently bridges a 32-bit AXI bus to the slave device so that any peripheral present in the Kintex can be addressed from the ARM as if it was directly implemented in the Zynq. The physical interface is quite flexible and can be adapted to a limited pin count; in this case, the communication employs 20 differential lines operating at 200-MHz double data rate (DDR) (9 data bits plus clock for each direction). On startup, the Chip2Chip automatically performs a deskewing self-calibration and then is immediately ready to use. In any configuration, the C2C master shows a single AXI slave port and the C2C slave a single AXI master port, so the bridge is not exactly symmetrical, but this does not entail a limitation in this design. Four interrupt ports for each direction are also present. The C2C channel multiplexer assigns higher priority to those over AXI data. The Zynq PS runs an embedded Linux distribution generated with the Xilinx Petalinux tools, providing a high-level interface to any functionality present in the board [including web services such as an Secure SHell (SSH) server]. During the boot up, the Linux image can be loaded from the on-board flash chip or downloaded from a remote server with the Trivial File Transfer Protocol (TFTP) protocol. Generally, most AXI cores offered by Xilinx also ship a driver often included in the Linux device tree. For custom-made cores without an AXI interface, a control interface is offered by an AXI-addressable register block, which is directly accessed from Linux user-space using the generic Userspace InputOutput (UIO) driver. The UIO driver greatly simplifies the development of drivers, which does not require a custom kernel module and fits very well with the view of offering a higher level interface to the functionalities implemented in the FPGA. Other off-chip devices, such as the I2C-programmable Si570 clock generator, Si5326 phase-locked loop (PLL), and bus multiplexer, can also be directly controlled from Linux by means of an AXI-based I2C controller. The kernel already includes drivers for the bus multiplexer and the Si570; the former transparently manages the multiplexer and the kernel is simply presented with a number of buses that can be directly accessed. In this application, the Si5326 is programed by a custom user-space software that calculates the required values of its internal registers and write them with a simple file access to the character devices representing the muxed bus associated with the device.
V. APPLICATIONS FOR THE π LUP BOARD
As already stated in Section I, the πLUP board was designed to fulfill a specific task, i.e., the readout upgrade for the next-generation ATLAS pixel detector, merging in a single board both I/O connections and data processing. Nevertheless, the πLUP features a huge variety of I/O connectors and three FMC connectors, making the board highly versatile and able to interface a wide variety of different other electronic devices and front-end chips. The choice of having two FPGAs connected in a master-slave mode guarantees enough power to perform high-level control operation on the board (Zynq-7 ARM core) and handle I/O communications through several different protocols (Kintex 7) while at the same time maintaining a relatively low price.
The three main possible applications for the πLUP board are as follows.
• Readout Control System: The πLUP can be used to directly interface a front-end device performing dataprocessing, data transfer to the PC via PCIe bus, online on-chip histogramming, and system control. The maximum bandwidth in this scenario is limited by the PCIe data transfer rate, i.e., 32 Gb/s for the eight lane gen. 2 PCIe bus.
• Data Generator/Front-End Emulator: The πLUP can be used to generate/emulate data to be sent to other systems, for example, to validate a reconstruction or data processing algorithm. The maximum bandwidth, in this case, is 80 Gb/s, which is the maximum speed of the eight GTx transceivers not used in the PCIe bus.
• Bridge Between Two Different Systems: The πLUP can be used as a bridge to connect two different readout systems that use different protocols or different communication physical layers. The maximum bandwidth in this scenario is highly influenced by the interfaced systems.
A. Interface With Felix
A first proof of the many possibilities of the πLUP board came from an integration test with Felix boards from the Felix Project [11] . The πLUP was connected through an optical fiber to a mini-Felix card (Xilinx VC709 evaluation board [12] ) and an FLX-712 card [11] . The test showed that the two boards were able to establish a communication via both gigabit transceiver (GBT, 4.8 Gb/s) [14] and custom Felix full mode (9.6 Gb/s) protocols. For both configurations, the πLUP used the channel PLL (CPLL) of the transceivers to recover the clock from the incoming data stream; the clock was then cleaned using the jitter cleaner Si5326 on the board and propagated to the quad PLL (QPLL) for the GTx transmitters, creating a synchronous data acquisition system. Fig. 5 shows the clock distribution of the πLUP board.
Using a Faster Technology FM-S14 FMC HPC mezzanine card (shown in Fig. 6 ) [13] , providing four additional SFP+ connectors and four-link connections were simultaneously established between the πLUP and the Felix cards, resulting in a total throughput of 19.2 Gbps in GBT configuration and 38.4 Gbps in full mode configuration. Both configurations were tested for about 1 h and no errors were found, demonstrating the reliability of the connections.
B. Interface With the RD53A Front-End Chip
This section shows an example of a real application for the πLUP board, used in collaboration with the Felix Project to interface the Felix card to the new-generation pixel front-end chip: RD53A [15] .
The need to use the πLUP board as an interface system arises from the physical and protocol incompatibilities between the Felix card and the RD53A chip. The first communicates via optical fibers through either 4.8-Gbps GBT or 9.8-Gbps full mode protocols, while the latter-currently bonded on a custom designed printed circuit board (PCB) called single chip card (SCC)-communicates via Display Port (DP) connectors through 160-Mbps E-link (input) and four lanes 1.28-Gbps Aurora 64/66 protocol (output).
The role of the πLUP is hence to act as a bridge between these two systems, handling both the Felix-to-RD53A data-path (downlink) and the RD53A-to-Felix path (uplink). This is done through different firmware blocks, as shown in Fig. 7 . The GBT_FPGA block decodes the GBT-formatted data from Felix containing the configurations commands for the RD53A chip and also synchronizes to the Felix clock, recovering it from the data stream. Both configuration commands and clock are then propagated to the TTC Encoder firmware block, which is in charge of converting the commands to an RD53A compatible format and of encapsulating them in a single 160-Mbps serial line, connected to one of the DP connector data lanes.
Concurrently, the πLUP receives and decodes Aurora 64/66 data from the RD53A chip, coming from the other four data lanes of the DP connector. Those four lanes 1.28-Gbps data (resulting in a total throughput of 5.12 Gbps) are then passed to the Protocol Converter firmware block, which merges them in a single full mode stream that is transmitted to Felix via optical connection.
Although the πLUP does not include a DP connector in its design, the usage of FMC cards can sort through this problem. In particular, two custom FMC mezzanines were developed to be used as an interface to the RD53A: single chip FMC (SCF), an HPC FMC mezzanine featuring two DP connectors, and multichip FMC (MCF), an LPC FMC mezzanine featuring four mini-DP connectors.
The maximum throughput for the πLUP can be obtained by the usage of both the MCF LPC mezzanine and the FM-S14 HPC mezzanine (featuring four SFP+ connectors), as shown in Fig. 6 . Using this configuration, shown in Fig. 8 , a Felix can interface four RD53A chips, resulting in a total throughput of 4 × 5.12 Gbps = 20.48 Gbps.
VI. TEST RESULTS
To evaluate and monitor the performance of the GTx transceivers on the πLUP board, the LogiCORE IP Integrated Bit Error Ratio Test (IBERT) core for seven series FPGA [17] was used. This IPcore generates the eye diagrams and calculates the open area and bit error rate (BER) for different I/O interfaces, which were connected in loopback mode. To test the four transceivers in the FMC HPC connector, a Faster Technology FM-S14 mezzanine card was used.
This mezzanine card shown in Fig. 6 implements four SFP+ connectors and two IDT ICS8N4Q001 programmable reference clocks.
The BER and eye diagram scans were performed at 5 Gbps and at 10 Gbps; the two speeds were chosen to be slightly higher than the design operation mode protocol speeds, i.e., GBT (4.8 Gbps) and full mode (9.6 Gbps). The tests were performed using a pseudorandom binary sequence (PRBS) of 31 bits and requiring a BER ≤ 10 −9 . The BER test was then continued until the error rate reached was ≤ 10 −14 . Fig. 9 shows the eye diagram of the tests run at 5 Gbps and Fig. 10 shows the results at 10 Gbps; Table III shows the open area results.
As shown in Fig. 10 , all the connections showed good results, apart from the SMA one, which has a BER = 10 −10 . The transmission errors are quite sure connected to an SMA connector malfunction, because it has been observed that the number of errors is related to the physical layout of the SMA cable. 
A. PCI Express
The workflow to validate and measure the performance of the PCIe Gen 2 bus on the πLUP required the design of a custom firmware implemented on the Kintex 7 FPGA and the development of custom Linux drivers allowing read and write operations from and to the RAM memory on the board, plugged in one of the PCIe slots of Linux pc. The test design consisted in using the PCIe bus to write and read the 2-GB DDR3 RAM associated with the Kintex device, measuring BER and speed. The firmware was entirely designed using the Vivado IP Integrator as shown in Fig. 11 . It is composed of a direct memory access (DMA)/Bridge subsystem for PCIe Xilinx DMA (XDMA), a Memory Interface Generator (MIG), and other support logic needed to correctly connect these two blocks.
The XDMA is an IP block that implements a high performance, configurable Scatter-Gather DMA for use with the PCIe Gen2.1 and Gen3.x that can be configured to be a bridge between the PCI Express and AXI memory spaces. The master side of this block reads and writes requests on the PCIe and its core enables the user to perform direct memory transfers, both Host to Card (H2C) and Card to Host (C2H). The MIG IP core is a controller and physical layer for interfacing seven series FPGA to DDR3 memory.
The custom drivers required to perform the test were developed for a Linux Ubuntu 16.04 Operative System. The test showed a peak user payload of 3.5-GBps data transfer when using buffers of 2 Mbyte and a BER ≤ 10 −14 corresponding to 24-TByte data transferred without errors.
VII. CONCLUSION
This paper presented the readout board πLUP designed in Bologna, focusing on its component and the result achieved. The technological choices and solutions were optimal to overcome many of the challenges presented by the upgrade of LHC. Hence, the πLUP is well suited to be used as an upgrade card for the readout system of one of the main LHC experiments, while at the same time maintaining a high flexibility and the potentialities to be used for several other applications, some of which were discussed in Section V. The outstanding results achieved and the relatively low cost makes the board an interesting candidate for a high-speed readout system.
