ABSTRACT: This paper describes the latest, full-functionality revision of the high-speed data link board developed for the Phase-2 upgrade of ATLAS hadronic Tile Calorimeter. The link board design is highly redundant, with digital functionality implemented in two Xilinx Kintex-7 FPGAs, and two Molex QSFP+ electro-optic modules with uplinks run at 10 Gbps. The FPGAs are remotely configured through two radiation-hard CERN GBTx deserialisers (GBTx), which also provide the LHC-synchronous system clock. The redundant design eliminates virtually all single-point error modes, and a combination of triple-mode redundancy (TMR), internal and external scrubbing will provide adequate protection against radiation-induced errors. The small portion of the FPGA design that cannot be protected by TMR will be the dominant source of radiation-induced errors, even if that area is small.
The ATLAS Tile Calorimeter
The ATLAS hadronic Tile Calorimeter (TileCal) [1] is a barrel-shaped detector situated outside of the liquid Argon (LAr) electromagnetic calorimeter ( Fig. 1) . It is divided longitudinally into four partitions, each comprising 64 wedge-shaped modules, for a total of 256 modules. The active volume consists of interleaved layers of steel and plastic scintillator tiles, which are read out by wavelength-shifting fibers to photomultipliers (PMTs). The PMTs and on-detector electronics are assembled together into modular "drawers" at the base of each calorimeter module. The drawers can be inserted and extracted for maintenance. -2 -
The TileCal upgrade
The Phase-II upgrade of ATLAS [2] is currently expected to take place from 2023 to 2025. By this time, ATLAS will have been operational for 15 years. Given that many of the systems were produced more than five years before startup, many of the system components in TileCal will be reaching their end of life. Another motivation for upgrading the TileCal electronics is that the increased luminosity of the upgraded LHC will also mean a proportional increase in minimum-bias collisions per bunch crossing. These higher backgrounds will make it increasingly difficult for the existing trigger system to effectively reduce the readout rate to manageable levels while preserving interesting physics. The upgraded TileCal will read out all data directly to the counting room over high-speed, low-latency links, making it possible to provide higher-resolution tower-sum data to the upgraded Level-1 trigger, with the same calibration as the data read out for offline analysis. Higher luminosities will also expose the electronics to increased radiation levels that can damage and/or upset the system components. Reliability will be a major concern, as well as minimizing the extent of data loss if a failure occurs. Maintenance will also become more challenging, as more care will need to be taken during maintenance stops to limit exposure to personnel due to activated detector materials. By increasing redundancy and adopting radiation tolerant solutions, the number of necessary interventions during the runs can be reduced. We have also decided to partition the electronics and drawer mechanics into smaller, independent units called "mini-drawers", which are one quarter of the size of the current full drawers. This simplifies maintenance by allowing fast replacement of individual minidrawers in the calorimeter, and allowing the failed electronics to be repaired in dedicated laboratories after they have had time to cool down. Finally, experience with existing detector systems has shown the value of flexible design. Using FPGAbased electronics will allow the new system to be continually improved without changing the hardware. FPGAs were an option considered in the original system, but the vulnerability to radiation damage was considered too high at the time. Modern FPGAs have much smaller feature sizes, and are therefore much less vulnerable in this respect. Figure 2 . Organization of the current (top) and upgraded (bottom) tile electronics.
The Phase II architecture
The partitioning and organization of the current and upgraded electronics are illustrated in Figure 2 . The drawer architecture of the current system comprises two mechanical units that share a common lowvoltage power supply (LVPS), as well as a common readout interface board that receives and distributes trigger, commands and clocks from the TTC system, and reads out data via two redundant fiber links. Failure of the LVPS or readout interface board will affect the entire module, comprising 45 or 32 PMTs corresponding to 0.4% or 0.3% of the Tile Calorimeter. In contrast, complete failure of an entire mini-drawer only represents a 0.1% loss of coverage. Since each calorimeter cell is read out by two PMTs, all main components of the upgraded TileCal mini-drawer electronics, including the digitizing Main Board, its Link Daughter Board and the High Voltage Power Supply are divided lengthwise, each side serving 6 PMTs corresponding to one side of 6 calorimeter cells. If one side of a board fails, the cells can still be read out by the PMTs serving the opposite side, leading to negligible loss of coverage. Three different concepts are currently being evaluated for the Front-End Boards (FEB) that are located in the PMT blocks. All three perform signal conditioning, charge-injection pulses, and slow integrators for calibration, and two of them also perform digitization. The three FEB versions are: a modified version of the current FEB ("3-in-1") based on discrete components [3] , the FATALIC based on a dedicated ASIC [4] and the QIE based on a charge integrating ASIC [5] . The final choice will be determined by careful evaluations in test benches and test beam campaigns. Each FEB type is supported by a matching Main Board (MB) [3] , which merges the data from all FEBs in the mini-drawer and adds additional services not included in the FEB. All MB versions share an identical physical interface with the Link Daughter Board (LDB), comprising a 400 pin FMC connector and a separate power connection. Functional differences between the different FEBs and MBs are accommodated through different firmware versions in the LDB FPGAs. The upgraded LVPS has a redundant 3 stage architecture. It receives 200V DC from power supplies in the counting room, and sends 10V DC to the Mini-Drawers. There, point-of-load regulators produce the required voltages on each circuit board. Each half of the MBs and the LDBs receive power from a separate, dedicated 10V brick. If one of the bricks connected with a Mini-Drawer fails, a diode bridge will switch to the other brick. Two different HVPS solutions are being evaluated, both of which use 800V supplies located off-detector in the counting room. One solution uses electronics in the Mini-Drawer to set and supervise the PMT HV levels, and also allows local disabling of individual PMTs. The second solution does the same, but is located in the counting room and supplies the individual PMTs via around 9900 cables. The off-detector readout electronics are called Pre-Processors or sRODs (superRODs) [6] . These combine the functionalities of the current TileCal digitizers, the PreProcessor modules in the current Level-1 calorimeter trigger, and a part of the current ReadOut Driver (ROD) functions. The sRODs send Detector Control (DCS) and run control data via 4.8 Gbps optical links to the LDBs on the mini-drawers. The LDBs send readout and DCS data to the sRODs over optical links running at 9.6 Gbps. Cell energies are directly extracted from the detector data stream and summed into trigger towers, which are then sent to the calorimeter hardware trigger (L0Calo in Phase-2), while the full data is temporarily stored in pipeline memories. Upon a level 0 accept the pipelined data are read out. 
The Link Daughter Boards
The Link Daughter Board [7] is the communications node of the Mini-Drawer. The LDB has undergone several revisions, and adapted to evolving specifications and component choices. The latest version is revision 4, which fully satisfies all functional and performance requirements for the TileCal upgrade. That said, the ambition level has increased over the course of multiple LDB prototypes, and we anticipate the need for a further iteration using newer FPGAs with improved radiation tolerance. The first LDB was based on the Virtex-6 FPGA family, but Kintex-7 was chosen for the following prototypes once it became available. Alternative link technologies have been evaluated as well, where the two major choices were Small Form-factor Pluggable (SFP) based or Avago ribbon fiber (12x) PPOD based links. The first LDB revision used two SFPs for the down links and a PPOD for the uplink, but in the second revision this was changed to one QSFP+ and a PPOD to allow comparison between the two technologies. From the third version the PPOD was dropped, and two QSFPs used, giving a symmetric solution. The latest version (revision 4) fixes some problems found in version 3 and was redesigned to better serve all three FEB types with further improved power supply noise, increased current and voltage monitoring capabilities and a revised clock distribution network. This also required redefining the FMC connector and introducing a dedicated power connector.
Both QSFPs are connected to both FPGAs (Fig. 3) , and read out data to the counting room via ~ 70 m fibers. There, a patch panel connects one of the two QSFP outputs to the preprocessor. Thus two of the four downlinks are connected with the FPGA on the same side, one directly, the other via a CERN developed radiation tolerant GBTX protocol chip [8] . The GBTX is responsible for extracting the system clock and for transmitting configuration information to the FPGAs via JTAG. Detector Control System (DCS) and run control data are transmitted via the other channel using the GBT protocol. The other two downlinks are connected with the FPGA on the opposite side. The four uplinks transmit information at 9.6 Gbps, which is within an excluded timing range for the Kintex 7 transceivers. Inserting dummy words can bring the rate up to 10.24 Gbps, which is inside the allowed range. For redundancy, 2 identical CRC protected data streams are transmitted from each FPGA. We plan to use a modulator based on a 4x10 Gbps radiation tolerant QSFP+ module from Molex with a Bit Error Rate of around 10 -18 . An issue with this solution is that the PIC processor in charge of initialization is not sufficiently radiation tolerant. There are several ways to solve this [9] . A Flash memory is included for fast reconfiguration of each FPGA. Normally the FPGAs are configured via these on-board Flash memories after receiving a remote reset. A second alternative is to configure remotely from the sROD (either completely or partially) via the downlink. If corrupted, the Flash memory can also be remotely re-written via the optical downlinks. If a temporary or permanent link failure occurs, this will be detected by the CRC error checking, and the system will rely on the other stream. Simultaneous CRC errors on both links will be interpreted as a corruption of the configuration of the corresponding FPGA. Finally, as mentioned above, if one side of a mini-drawer fails completely, the opposite side can still read out all of the cells.
To correct Single Event Upsets (SEUs) and Multi Event Upsets (MEUs) we use the internal scrubbing process where the configuration memory is continuously scanned for single bit and adjacent double bit errors, or else we use a specially designed IP-block to handle a most of the remaining double bit errors. Multibit errors can also be corrected by partial or full remote reconfiguration. Finally we envision a continuous supervision of the content of the Flash memory, where the content is re-rewritten if errors are encountered
PCB
The LDB version 4 stack-up was increased by two layers to a 16 layer PCB. All six signal layers are adjacent to ground planes on both sides, and the high-speed signals are routed in the inner layers. The power planes are grouped together in the middle of the stack.
The power distribution network eliminates problems with power drops. To decrease the jitter of the FPGA internal clock nets to the lowest possible value, additional power supply filtering was implemented down to an almost analog level. This was especially relevant for the 1V core supply to the transceivers. Much effort was spent to route related signals to same IO banks. This will improve timing compared to LDB version 3, and reduce possible alignment problems. 
The Upgrade Demonstrator
The final TileCal upgrade will include only digital readout of calorimeter data to the sRODs. However, it has proved feasible to produce a so-called hybrid Upgrade Demonstrator (Fig. 4 ) that can operate in the current system while also allowing the upgraded functionality to be tested. To do so we used a modified 3-in-1 FEB that can deliver an analog trigger signal as well as the digital output used by the new system. We retained the trigger summation boards from the present system that sum the analog outputs on the detector into tower sums for analog transmission to the Level-1 calorimeter trigger. The demonstrator preprocessor interprets commands received from the TTC system, and translates them in a form that is usable by the Mini-Drawers. It also performs the inverse process on the output data, converting the data into a form acceptable by the current RODs. A special processor is required to interpret the DCS CAN-bus data. The full compatibility of the demonstrator with the present system has been verified using MobiDICK [11] , the standard drawer verification test bench. We aim to install a hybrid demonstrator comprising four mini-drawers in one of the ATLAS TileCal modules over the 2016 Christmas shutdown. This will allow us to test the electronics in real beam conditions, and hopefully find and eliminate problems that might otherwise be overlooked. Four full demonstrator units with version 3 LDBs are currently operating stably in test bench and test beam studies. We also have LDB+MB combinations in Chicago, Stockholm, Valencia and ClermontFerrand for firmware development. These setups are to be upgraded with v4 LDBs, depending on GBTx availability.
Effects of radiation
A moderate radiation level is expected at the radius of the TileCal electronics drawers. By the end of the current run, less than 0.2 Grays per year and 10 10 protons per square cm and year at the most exposed position (near the crack) are anticipated (Fig. 5) . line the small feature sizes in state-of-the-art FPGAs, permanent errors are scarce at these radiation levels. On the other hand single-event upsets (SEUs) are rather common [10] . In each FPGA, roughly 3 single-bit SEU errors per day and one double bit error per three days are expected. A preliminary radiation test at MGH in Boston gave an estimate of 10 SEUs per day. New radiation studies are planned with LDB v4. To mitigate the effect of SEUs, we will use Triple Mode Redundancy wherever possible. This corrects single errors immediately, while persistent errors cause the TMR functionality to be disabled until the error has been corrected. This time must be minimized. The duplicated links are protected by CRC (uplink) and the GBT-protocol (downlink), making it easy to identify which link has failed. The probability that both links randomly fail at the same time is small. Most data errors are of short duration (less than one s). Single bit configuration errors are fixed by "scrubbing" in about 30ms. Double bit errors take somewhat longer. Complex errors will require full or partial reconfiguration, which may take several seconds.
It is easy to estimate the number of TMR failures due to SEUs in a 2000 FPGA system: 14 FPGA n mod/ per year, were , the fraction of TMR protected circuits times the fraction of errors in this area that affect function is much less than 1. should be large, as we wish to use a fine grain TMR. Thus the probability for this type of errors should be very small. A small part of the fabric cannot be TMR or CRC protected. The probability for an error in these regions is very low since the area is very small. However, since we don't have the beneficial effect of TMR (squaring the rate and multiplying with a small time window), it can easily be comparable to or exceed the error rates in the TMR regions.
Conclusion
One of the largest concerns has been to achieve sufficient reliability, which we have reached through partitioning, redundancy, and link duplication. This was found to be a reasonable strategy also considering the cost. The estimated rate of single and double bit errors in TMR or CRC protected parts is expected to be small. However, the small part that cannot be covered by TMR or CRC is a concern and must be minimized. Due to the board redundancy, the calorimeter cell will still be read out even if errors block a FPGA path.
Transient errors in both FPGAs are very unlikely, unless there are correlated events such as a hit in both devices by the same particle shower. Four generations of link daughter boards have been designed to study evolving design criteria. Even though the latest revision addresses all concerns and performance requirements raised so far, we anticipate a final version based on Kintex-Ultrascale+ FPGAs, which promises greatly improved SEU resistance.
