Abstract-The current ATLAS Tile Calorimeter read-out system is scheduled for replacement around 2023 due to old age and higher performance needs. The new proposed system is designed to be radiation tolerant, modular, redundant and reconfigurable. To achieve full detector read-out, Kintex-7 FPGAs from Xilinx will be used, in addition to multiple 10 Gb/s optical read-out links. During 2015/2016, a hybrid demonstrator system including the new read-out system will be installed in one slice of the ATLAS Tile Calorimeter to evaluate the new design. This paper describes different firmware strategies along with their integration in the demonstrator in the context of high reliability protection against hardware malfunction and radiation induced errors.
I. INTRODUCTION
T HE ATLAS [1] Tile Calorimeter (TileCal) [2] is a hadronic calorimeter based on steel plates and scintillating tiles read out by photomultiplier tubes (PMTs). It is partitioned into four cylindrical sections, each composed of 64 wedge shaped modules. The electronics are located in extractable super-drawers at the base of the wedges. This places them at the outside of the calorimeter, shielded by the calorimeter iron tiles. The super-drawers contain either 45 or 32 photomultiplier tubes (PMTs) depending on section, with each calorimeter cell read out by two PMTs. In total there are 9852 PMTs, reading out 4926 calorimeter cells. In the current system, the PMT signals are amplified and shaped by analog front-end boards, (3-in-1), with two different gains (with gain ratio 64). This makes it possible to cover a dynamic range of 16 bits using two 10 bit ADCs. The signals are digitized and stored in pipelined memories to await a corresponding level-1 trigger accept before being read out off the detector. The trigger decision is formed from analog signals from the front-end boards, summed to represent tower geometries and transferred to the Level-1 calorimeter trigger in the counting room outside the detector. For calibration purposes, the 3-in-1 front-end boards can also produce realistic charge injection pulses as well as read out the integrated response from a circulating cesium source.
By 2023, the LHC accelerator will be upgraded [3] to an instantaneous luminosity up to 5 times the nominal value and an average luminosity of up to 10 times the nominal average. This implies more complex events overlaid with minimum-bias backgrounds, as well as elevated radiation exposure to the electronics. Significant technology advances since the initial TileCal electronics were built, now allow much higher readout bandwidths and radiation-tolerant designs using commercial components.
In the upgraded Tile Calorimeter electronics planned for 2023, PMT data for all LHC bunch crossings will be read out to the off-detector electronics [4] , allowing the Level-1 trigger selection to make use of cell-by-cell digital processing, in contrast to the current pre-summed analog towers. This will help reduce the effect of minimum bias pile-up on triggering, which is expected to be a considerable problem in high luminosity operations.
To improve reliability, the proposed electronics system [5] will be made more modular with smaller independent modules and improved redundancy. This modular structure means that tasks are distributed and de-centralized resulting in fewer single points of failure. Both hardware and firmware are made with redundancy in mind, e.g., having the circuit boards functionally split lengthwise and triplicating important parts of the firmware as explained in Section II.
In contrast to the current system where each module covers 45 PMTs, the new system will be segmented into modules covering 6 PMTs each, with each calorimeter cell read out by two independent modules. If one module fails the cell will still be read out, although with somewhat worse precision. If both modules fail, only six cells will be lost, corresponding to 0.12% of the detector.
Mechanically the super-drawers will be replaced by four mini-drawers, with each mini-drawer containing 12 PMTs read out by two modules. Their smaller size will improve access for service or replacement during maintenance periods. Repair and replacements of failed components parts may be performed by extracting the failing mini-drawer (for later repair off-detector) and directly inserting a well proven one, thus limiting radiation 0018-9499 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. exposure to personnel. Two redundant 10 volt power bricks deliver the low voltage power to each module, each with the power capacity to drive both modules.
To gain experience with this design, a hybrid demonstrator project ( Fig. 1 ) is being assembled, with the goal of installing a prototype module in a Tile calorimeter slice in 2014 or alternatively during the next possible shutdown period. The demonstrator must be able to operate seamlessly with the present system so that it will not disrupt data taking, so analog tower sums must still be provided to the Level-1 Calorimeter trigger. This consideration has also driven the choice of front-end solution for the demonstrator.
The demonstrator on-detector electronics is housed in the mini-drawers, each containing 12 modified 3-in-1 front-end boards, a main board for digitization, a daughter board for communication and control, and a high voltage power supply. The three latter are designed as two independent identical parts, each servicing six PMTs. There is also an assembly of legacy analog summation boards to provide compatibility with the current Level-1 trigger.
Two modulator-based modules are used for communication to provide radiation tolerance [6] and redundancy, each conveying four 4.8 Gb/s down-links for clock synchronization and configuration, and four 10 Gb/s up-links for data and monitoring. The logic is implemented in Kintex-7 FPGAs [7] with elaborate error mitigation techniques to be able to recover from radiation-induced errors.
II. THE DAUGHTER BOARD
The Daughter Board [8] (Fig. 2) , which is responsible for communication with the off-detector electronics, is being developed at Stockholm University. It is physically designed to reside on the main board designed at the University of Chicago, which controls and digitizes data from the front-end 3-in-1 boards. To improve reliability, the system is designed to be inherently redundant, duplicating or triplicating all functions to minimize single point failure modes.
The two halves of the board are designed symmetrically (Fig. 3) , with each side serving 6 PMTs as mentioned above.
There is a set of four 4.8 Gb/s down-links, one of which goes to the CERN-developed radiation-hard protocol chip, the Giga Bit Transceiver (GBTx) [9] . The GBTx, in turn, connects to the JTAG chain to program the FPGAs. The other three down-links enter GTX transceivers in the FPGAs. Connection with the main board is via a 400 pin high-speed connector.
Two identical Kintex-7 FPGAs are used for the logic functionality in the daughter board. A series of increasingly complex prototypes have been designed and a possible final version is currently being tested. The demonstrator must be thoroughly tested to verify functionality and radiation tolerance be-fore being installed in order to ensure the data integrity. Radiation testing of separate components is currently underway.
III. RADIATION INDUCED ERRORS IN FPGAS
To limit the effect of radiation induced damage as well as component failures, the overall firmware design must be carefully considered. The number of gates needed for a given design is much higher in FPGAs than in an ASIC. An FPGA is therefore more susceptible to radiation damage. Radiation affects electronics cumulatively as well as on a single event basis.
Radiation will affect the components in different ways depending on energy and type of radiation. Passive components are generally unaffected, but dense logic like that of an FPGA is more susceptible. Neutrons can cause displacement damage in semiconductors, however CMOS devices are comparatively tolerant against such effects.
Ionizing radiation can also displace lattice atoms in the long term, but the main issue is single event upsets (SEUs). Ionized tracks from high energy particles interacting with the semiconductor can flip memory bits or alter cause state machines.
With the small feature sizes of current IC technologies, the risk of permanent faults is reduced, but the smaller feature sizes also means smaller charges are needed to upset logic states of flip/flops. SEUs are thus more frequent when using smaller feature sizes. However, SEUs can be detected and corrected using firmware activated mechanisms that have been incorporated in the FPGA design.
IV. MITIGATION OF RADIATION INDUCED ERRORS
To a large extent, SEUs can be mitigated in the FPGA firmware, but at the expense of increasing the gate count. SEUs are treated differently depending on if the errors appear in the configuration memory or in the logic fabric. Errors in the configuration memory can be dealt with using scrubbing or partial reconfiguration.
The scrubber repeatedly reads the entire configuration memory, compares with a checksum and tries to find and correct errors. One bit errors are repaired within 1 ms. Some two-bit errors can also be repaired, specifically errors affecting two adjacent bits. This however takes 30 ms. Uncorrectable multi-bit errors are found within 15 ms but are flagged and halt the scrubber. These errors must be mitigated by full or partial reconfiguration.
Single and multiple upsets in the logic fabric as well as long term damage can be mitigated by redundant data paths and voting (triple module redundancy, TMR). This means that either the entire firmware, or specific modules are triplicated. They share the same input and the resulting outputs are compared. An SEU can render one such module unreliable, but as the three results are compared single errors can be found and ignored.
If three such modules are well separated geographically the likelihood that a MEU will affect more than one module is small, which means that the error will not cause any data failure. During the duration of a failure (until overwritten by new data or fixed) the TMR function is inactive. A new failure in the remaining two modules will not be corrected. This interval should be minimized to minimize the probability for uncorrected errors.
TMR can be implemented in many different ways, either by using three separate FPGAs, by triplicating the entire firmware, or only specific functions. The chosen method in the Daughter Board is to triplicate important functions within the FPGA. This provides protection against most SEUs but does not require any additional hardware. Only triplicating the most important readout related functions reduces the firmware area needed and simplifies timing optimization.
Although automated TMR tools are available, they are either not cost effective or not applicable to this particular system. The selective TMR approach is best suited to be implemented manually. Fig. 4 illustrates a TMR module where incoming data is split and fed into three identical modules. The three output streams are compared bit-by-bit. Any mismatch is flagged and the remaining two bits are assumed to be correct.
All important functions of the design will be protected using TMR. Data integrity is ensured with check sums in the transmission links. Local memories are avoided when possible, but will have to be protected by error correction mechanisms. Fast detection, good coverage and effective repairs are crucial to reach the reliability level that is necessary for this project.
V. RADIATION TEST SETUP
Radiation testing has so far focused on SEUs. Accelerated life testing of neutron damage was performed on an older version Daughter Board during late summer 2014, along with other components of the demonstrator. The board has, however, not been fully investigated yet.
The SEU tests of the daughter board have been performed at Massachusetts General Hospital (MGH) Francis H. Burr Proton Therapy Center during 2014. Their cyclotron delivers a very versatile proton beam with a wide range in energy and flux.
Normally, the beam is guided to patient rooms but the facility also has a general purpose experiment hall where the beam pipe simply protrudes from a wall (see Fig. 5 ) and extends into the room.
A cooperating group from Argonne National Laboratory began by testing a set of separate boards and a Daughte Board was eventually put in front (see Fig. 6 ). The proton beam was large enough to completely and evenly cover one FPGA.
The FPGA scrubber was running during irradiation and any error or status message was sent via a serial interface and saved to disk on a PC in a nearby room (separated with a concrete maze for neutron protection).
In addition to testing the FPGA scrubber, the internal Gigabit transceivers (GTX) were tested by using a loop-back test. Dummy data was sent to the GTX where it was formatted and folded back in the far-end of the transceiver. The returned data was compared and any mismatch reported. The rate of transmission errors was measured internally on the FPGA as the optical modulator has been tested separately [12] .
Running at a flux of protons/cm s(216 MeV) for approximately one hour resulted in an equivalent 100 day run at a nominal luminosity of cm s . The data gathered from the device under test was used to estimate the expected rate of SEUs, the scrubber success rate, and to give a lower limit to the number of transmission errors in the FPGAs transceivers.
VI. RADIATION TEST RESULTS
By comparing the proton rate used at MGH with the expected post upgrade rate at LHC [11] , the rates of different kinds of errors (scrubber 1 bit, 2 bit, GTX errors) can easily be estimated. The results achieved are compatible with results from test performed by the ATLAS Liquid Argon collaboration [10] .
For a given Kintex FPGA in a daughterboard under post upgrade luminosity, we expect 30 one-bit errors per week and one repairable two-bit errors per week. Unrecoverable errors will be less frequent, but will nonetheless happen 1-2 times every month. This will in most cases require reconfiguration, amounting to a maximum of one minute of downtime and partial loss of statistics in one mini-drawer. However, with partial reconfiguration and TMR, the consequences of unrecoverable errors should be significantly reduced.
Transmission errors in the high-speed GTX transcievers, caused by upset flip-flops in the design will happen 5 times per month. However with redundant links transmitting CRC encoded data, this will most likely not cause any loss in data.
A summary of the test results can be seen in Table I . The upset rates were measured by monitoring the scrubber's status interface. Each error gets classified and, if possible, repaired.
VII. SUMMARY
The use of FPGAs in the Tile calorimeter gives us an unprecedented level of flexibility, but it also means that the design is more susceptible to radiation induced SEUs.
With the relatively low local flux after the luminosity increase and with the right mitigation strategies, we expect the reliability to reach acceptable errors even considering the fact that we plan to use 2000 FPGAs.
Preliminary radiation tests show promising results. When using thousands of unmonitored FPGSs, some will cease to function due to SEUs. So while errors do occur, their effects can be mitigated using scrubbing, TMR and checksums. With the appropriate mitigation techniques as outlines above, upset FPGAs can be recovered without any data loss. However, more investigations are needed to reach this point.
