Abstract-TileCal is the central hadronic calorimeter of the ATLAS experiment at the Large Hadron Collider (LHC) at CERN. A main upgrade of the LHC (also called Phase-II) is planned in order to increase the instantaneous luminosity in 2022. For TileCal, the upgrade involves the redesign of the complete read-out architecture, affecting both the front-end and the back-end electronics. In the new read-out architecture, the front-end electronics will transmit digitized information of the full detector to the back-end system every single bunch-crossing. Thus, the back-end system must provide digital calibrated information to the first level of trigger. Having all detector data per bunch crossing in the back-end will increase the precision and granularity of the trigger information, improving this way the trigger efficiencies. A reduced part of the detector, 1/256 of the total, will be equipped with the new electronics during 2016 to evaluate the proposed architecture in real conditions in the socalled "demonstrator project". The upgraded version of the Read-out Driver (sROD) will be the core element of the back-end electronics in Phase-II. This module includes two Xilinx Series 7 Field Programmable Gate Arrays (FPGAs) for data receiving and processing and will be installed and working in an ATCA framework. A complete description of the sROD functionality in terms of firmware will be introduced. In addition, the status of the firmware development, a summary of the main milestones achieved and the future plans will be presented.
I. THE ATLAS TILE CALORIMETER

A. Introduction
HE Large Hadron Collider (LHC) [1] is the most powerful particle accelerator ever built. It is located inside a 27 km long circular tunnel on the surroundings of Geneva (Switzerland) at CERN. It is designed to study proton-proton collisions with energy of up to 14 TeV in the centre of mass. Two proton beams circulate in opposite directions, the beams cross themselves in four interaction points, where the main experiments of the LHC are located: ALICE, CMS, LHCb and ATLAS [2] . ATLAS is a general purpose experiment: a huge cylinder composed of many particle detectors, arranged in layers around the interaction point. The Inner Detector reconstruct the trajectories of the particles produced in the collision and their momentum. The Calorimeters measure the energy of the particles and the position in which this energy is deposited. The Muon Spectrometer reconstruct the position and momentum of the muons escaping from the other subdetectors. The Magnet System is needed to curve the particles as they advance on the magnetic field, so their momentum can be measured.
B. The Tile Calorimeter Subsystem
TileCal [3] , [4] is the central hadronic calorimeter of ATLAS. It is a sampling calorimeter composed of steel as absorber and scintillating plastic tiles as active material. Located between the electromagnetic calorimeter and the muon chambers, it is hollow cylinder divided in four partitions: EBA, LBA, LBC, EBC, each of them divided azimuthally in 64 wedges (TileCal modules). The active material is arranged in a cell structure that provides the desired granularity in every module. An hadronic particle that crosses a cell produces light in the scintillating plastic tiles, which is proportional to the amount of energy deposited by the particle along the detector. The light collected in a cell is conducted using wavelength-shifting fibres to a photomultiplier device (PMT) that translates it into an electrical signal. TileCal measures energy and position of hadrons, jets, taus and, jointly with other calorimeters, it provides information on the missing transverse energy (ET).
C. Front-end Electronics
The PMT is the first element on the TileCal read-out chain. The electrical signal delivered by this device is shaped and amplified on the 3-in-1 cards [5] using high and low gains. The output signals of these cards are received in the Digitizer Board [6] , equipped with ADCs for the digitization of the high and low gain signals, a TTCrx chip [7] , [8] that provides synchronization with the LHC clock, and pipeline memories for the temporary storage of sampled data. An output of the 3-in-1 card is also sent to an Adder Board [5] that makes the analog sum of a tower of cells and transmits this information to the L1Calo Trigger system for the trigger decision. Finally, the Interface Board [9] receives the selected events from the Digitizer Boards, packs them into a specific format, serializes these data using HDMP TX chips [10] (based on the G-link protocol) and sends them to the back-end electronics using optical links at 640 Mbps.
D. Back-end Electronics
The main element on the TileCal back-end electronics is the Read-out Driver board (ROD) [11] , a VME 9-U standard board operated in a VME crate. Each ROD, receives T information of eight consecutive TileCal mo input bandwidth of 5.120 Gbps. An HDMP R serializes the information of each input lin parallel data is routed through staging mezzanine cards called Processing Units ( cards host two Texas Instruments TMS32 Signal Processors (DSPs) in charge of com algorithms (Optimal Filtering Algorithms reconstruction of the energy and time and the quality factor of these reconstructed magnitu performed by the DSPs are the synchroniza with the trigger information, data co computation, error detection by Cyclic Redun (CRC), event monitoring and busy signal Output Controller FPGA packs all the inform by the DSP into the TDAQ Event Data Form sent to the Read-out Buffers (ROBs) utilizin the transition modules of the VME crate.
E. The ATLAS TDAQ
The Trigger and Data Acquisition System the mechanism used for the event selection storage of the read-out data from ATLAS. Th bunch crossings can produce up to thousand particle events per second in the interaction p decrease the processing throughput an bandwidth, the TDAQ selects those events th from a physics point of view. The event data based on three levels of trigger that define d for the read-out electronics in terms of even methods to do these selections. The three trigger levels are depicted in Fi Trigger Processor (CTP) gathers Levelinformation from the calorimeters and the every bunch crossing (at a rate of 40 M information the CTP selects only the intere were temporary stored in pipeline memo sending a Level-1 Accept (L1A) signal to the odules, having an RX chip [10] denk. The resulting FPGAs to two (PU) [12] . These 20C6414 Digital mputing real-time s [13] ) for the computation of a udes. Other tasks ation of the data ompression, ET ndancy Checking generation. The mation processed mat [14] , which is g S-link cards on m (TDAQ) [15] Luminosity LHC (with a TeV) and peak luminosity of upgrades are foreseen in e replacement of the frontronics will take place to luminosity conditions. read-out architecture will th high precision digital All detector data will be unch crossing and stored in gger event selection will be architecture will deploy ical links, as well as at the anced radiation tolerance. he data links, the change of esults in a considerable t bandwidth, as shown by UT 
BANDWIDTHS
Upgrade 40 Tbps 4096 10 Gbps is divided in three phases ed LHC shutdowns. The planned to be executed in ps are foreseen in Phase-0 in the TileCal upgrade demonstrator of the new modules during Phase-0. and digital trigger (hybrid e old and the new trigger nsecutive hybrid modules te the new read-out system inally, in Phase-II, all the with the new architecture, igital.
or Phase-0 veloped in order to test the played in Fig 2. The frontend electronics of the demonstrator module is divided into four mini-drawers that can be operated independently in terms of power, configuration and data read-out providing higher modularity. Each mini-drawer hosts a Main Board that digitizes the analog signals from the 3-in-1 cards, a High Voltage Card that distributes the high voltage needed on the PMTs and a Daughter Board, which collect data from 12 channels and transmit them using serial optical links. In addition, Adder Boards are also present on the mini-drawer to provide the analog trigger signals to the L1Calo. The upgraded version of the ROD in the back-end, is called super Read Out Driver (sROD). The sROD prototype can read-out data from four mini-drawers (48 PMTs) and store it in a memory buffer, perform the reconstruction of the sampled signals for each channel and prepare trigger primitives (towers) to be used in the L1Calo. It will also route the configuration and DCS commands and monitoring to the front-end electronics. 
III. THE SROD DEMONSTRATOR
The sROD demonstrator board works on an Advanced Telecommunications Computing Architecture (ATCA) framework, which is the industry standard for modular electronics that has been selected to replace the VME crates present on the previous DAQ systems. The board is an Advance Mezzanine Card (AMC), with a double mid-size form factor, designed to be plugged into an ATCA carrier blade that provides mechanical support as well as power supply distribution and communication to the back plane or the Rear Transition Module (RTM). As introduced in previous section, the sROD demonstrator has to perform several functions in the TileCal upgrade demonstrator. Fig. 3 represents a block diagram of the functionality of the board. In first place it has to receive data from a complete super-drawer (48 PMTs). The connectivity is achieved using state-of-the-art optical modules (QSFP, MiniPOD and SFP+), which provide all together an input and output bandwidth of 290 Gbps. The received data is collected in two Xilinx Series 7 high performance FPGAs [17] , equipped with integrated Multi Gigabit Transceivers (MGTs). One is a Virtex-7 XC7VX485T with 485760 logic cells, 37080 Kb of Block RAM (BRAM) and 8175 Kb of distributed RAM, 48 high speed GTX MGTs for high speed serial communication, 2800 DSP slices for signal processing applications and 350 user IO pins. It is in charge of managing the incoming data from the front-end. The second one is a Kintex-7 XC7K420T with 416960 logic elements, 30060 Kb of BRAM and 5938 Kb of distributed RAM, 28 GTX MGTs, 1680 DSP slices and 380 available user IO pins. This FPGA receives data from the previous one in order to perform trigger pre-processing tasks and transmit the resulting information to the L1Calo. Each FPGA is connected to a parallel flash module in order to store configuration data, coefficients for the digital signal processing or even a kernel image for a possible embedded system based on MicroBlaze soft processor. Besides, both FPGAs are connected to 512 MB DDR RAM memories that allow the storage of large fragments of data for monitoring applications or test purposes. The AMC connector provides communication to the AMC carrier or to a micro-TCA chassis back-plane. The sROD prototype includes also an USB to UART converter for serial communication as well as dedicated Gigabit Ethernet connector. In order to expand the functionality of the prototype, the sROD includes 400 High Pin Count FPGA Mezzanine Card (FMC) Connector.
IV. DATA FLOW THROUGH THE DEMONSTRATOR
The Daughter Board is the data concentrator in the frontend. Their main elements are two Kintex 7 FPGAs (one per side) used for a diversity of tasks: receive digitized data from 6 channels of the Main Board, distribute the TTC clock to the Main Board ADCs, transmit TTC commands to the 3-in-1 cards and the Integrator ADCs, receive Integrator data from the corresponding FPGAs in the Main Board, send DCS commands to the HV card, and read back values from PMT voltages and temperatures. The Daughter Board merges all these kinds of data and establishes high speed optical communication with the back-end, having 8 uplinks at 10.24 Gbps (four with data and four of redundancy) and 8 downlinks at 4.8 Gbps.
V. THE SROD DEMONSTRATOR FIRMWARE
This section explains the functionality of all the firmware that has been developed up to now, which is deployed on the Virtex 7 XCVX485T FPGA. This FPGA manages the optical communication with the front-end, extracts the event data from the merged global Uplink, controls the TTC and DCS information reception and distribution, and performs some infrastructure tasks such as communication with the user interface (ATCA framework or standalone computer). The top level of the firmware hierarchy hosts a set o are in charge of specific sets of tasks. 
A. The System Module
This block groups all the firmware that is d infrastructure like the sROD connectivity w framework. This firmware is stable and shoul by sROD users or developers. It is separated modules implementing specific functionalities
1) Clocks
All the clock frequencies needed in the ot System Module are generated here. It compr set of clock buffers to distribute the clock sig controls the reset process of the other Syst according to the lock status of the PLL.
2) Ethernet
The Ethernet connection is handled by a c the Ethernet PHY on one side and the MAC o
3) IPbus
The IPbus [18] is an UDP/IP-based proto an easy way to send commands and read bac hardware devices using a wishbone bus v protocol is implemented using a set of softwa FPGA core.
4) G-Link
The Demonstrator has to be compatible architecture. The sROD demonstrator will tra ROD board in the VME crate. This core i emulation of the HDMP 1024 chip [10] pres Interface Card that serializes data and tran Mbps.
B. The GBT wrapper
The GBT [19] is a radiation-tolerant hig developed at CERN for the transmission of p information and experiments slow control in single 120-bit long frame. The GBTx is a ra performs the functionality of the GBT proto end electronics. However, in the back-end protocol can be implemented in commercia don't have to be radiation hard. The GBT F released several versions of the code needed GBT protocol targeting the most common market [20] .
The GBT wrapper is the block of the firmw implementation for the Virtex 7 FPGA of se with characteristics adapted to the TileC 
in the downlink direc without error correction in the upli composed of a GBT Tx block, a GT Rx block. The Tx block scrambles t it using Reed Solomon (RS) codifica capability is needed), and divides th 40 bit streams that fit the data T transceiver. The GTX transceiver se 40 bit streams and sends the serial differential pairs to the QSFP opti side, it receives the 10.24 Gbps ser de-serializes it into 40-bit data word transceiver Rx data interface. Fin aligns the data into word boundaries converts the 40-bit words into 120 (if it's needed) the RS codes and words. Due to the high line rates of the clocks used to operate the transcei have strong quality requirements in special care has been taken wh distribution network inside the combination of clock resources (c constraining the area of the GBT region in the FPGA for avoiding critical clock paths.
Since the sROD processes the d (four mini-drawers), the GBT wra times on the top level module of the
C. sROD Data Processor
This module hosts all the logic t inside the sROD. Again, the functi blocks can be grouped following downlink. Fig. 6 displays the main Data Processor related to the uplink ctional asymmetric links, d Error Correction (FEC) ction, and a 10.24 Gbps ink direction. Each link is TX transceiver, and a GBT the incoming data, encodes ation (if the data correction he 120 bit GBT frame into Tx interface of the GTX erializes, on one side, these stream at 4.8 Gbps using ical module. On the other ial stream from the QSFP, ds and outputs them on the nally the GBT Rx block, s using a GBT header field, -bit GBT frames, decodes descrambles the resulting Tx and Rx blocks serial links, the reference ivers and the related logic n terms of noise or jitter. A hen designing the clock FPGA using the proper clock buffers, PLLs) and blocks in the same clock excessive delays between data of a complete module apper is instantiated four firmware hierarchy. that controls the data flow ionalities of the individual either the uplink or the components of the sROD dataflow. 1) GBT decoder The GBT frame received on each optical up data of the two gains for six consecutive PM for the proton bunch cross inside an orbit ca ID (BCID), data from the integrator ADCs configuration values and DCS voltages a monitoring values. Since the RS encoding i uplink, a 16-bit CRC field is also included in to introduce error detection capability. All th organized in the GBT frame according to format, that establishes the length and positi field for every different kind of information within the GBT frame. The GBT decoder organizes and outputs every kind of info corresponding port by means of several seq which are deployed in this module. Every G data samples of 6 PMTs in one of the tw identifies if the received word contains h information. The data from the other 6 PM mini-drawer are transmitted using a different reception of the read-back of a TTC or information to be read consists of a 16-bit add a 32-bit parameter. In order to read-back on frames are needed. The module implemen checker and an error counter for each link.
2) IPbus Slaves
The system has certain amount of memor the user logic blocks. This memory can be IPbus and it is divided into sets of slave regist with a dedicated functionality. As an exa specific registers for controlling configuration electronics elements, configuration of the sR sending TTC commands, for storage of samp of monitoring values and many other things t be accessed from the user side.
3) TTC and DCS Receiver
This module receives the TTC and DCS field of three consecutive frames from th containing a TTC or DCS command read-bac of frame indicates the daughterboard address the third hold the 16 lower and higher bits read-back parameter respectively. The modu a Processor link carries event MTs, an identifier lled Bunch Cross s, TTC front-end and temperatures s not used in the n the GBT frame his information is a proposed data ion of each a bitto be transmitted retrieves, splits, ormation to their quential processes GBT frame carries wo gains. One bit igh or low gain MTs of the same serial link. In the DCS value, the dress followed by ne value, 3 GBT nts also a CRC ry that interfaces accessed through ters, each of them ample, there are n of the front-end ROD readout, for ples data, storage that are desired to information bithe GBT decoder ck. The first piece s. The second and of the command ule identifies and splits the information depending on received. The read-back of TTC com configuration data. In the case of D monitoring values of DCS elem retrieved information is copied in i of the IPbus slave registers.
4) Integrator Receiver
This block receives the co corresponding to the integrator A frame. It implements a state m consecutive words containing the n sample of the correspondent integra long. These values are then copied region to be read using IPbus.
5) Readout Module
It receives the events data from th stage, it extracts data samples of the them according to the PMT numb following the scheme of the data for performed in the Raw Data Decod PMT and gain are stored tempora waiting for the Level 1 event select per PMT and gain, with capacity for of 100 events. In a further step, a se allows to read the selected events a if they arrived from the detector in Packer receives the samples from th each PMT and gain and packs them be sent to the G-Link module on t fast read-out using the present syste fast readout method, there is the limited number of consecutive samp dedicated set of the IPbus slave regi is slower than the G-Link-based on test purposes. Finally, the Readou block with logic for creating data h in a dedicated region of the IPb possibility of creating histograms development and test phase for bot An example of this feature has bee noise levels or the stability of the l the different sub-modules inside the Fig. 7 . The Readout which kind of command is mmands delivers front-end DCS commands it delivers ments. Each piece of the ts specific memory region ontents of the bit-field ADCs data from the GBT machine that collects 5 number of PMT and a data ator ADC, which is 16-bit d in the dedicated memory he GBT decoder. In a first e GBT frame and separates ber and high or low gain rmat [21] . This operation is er block. Samples of each ary in pipeline memories tion. There is one pipeline r hosting up to a maximum et of de-randomizer buffers at a constant data rate even random bursts. The Event he de-randomizer buffers of m into the G-Link format to the system module for the em RODs. Apart from the possibility to access to a ples by means of reading a isters. This kind of readout ne but is quite versatile for ut Module includes also a histograms that can be read bus memory. Having the s is very useful in the th hardware and firmware. n the study of the channel linearity. Fig. 7 represents Readout Module.
t Module
6) GBT encoder
This module is the main block amo functionalities related to the downlink directi GBT frame in the downlink direction includ field, so the user space within the frame is used for the transmission of TTC comm commands. Both kinds of commands can (configurable to be sent in a particular BCID) (to be sent immediately), as well as L1A and (BCR) 
7) TTC Decoder
The TTC decoder is a piece of firmware recovered clock and serial data stream from t It decodes the TTC information and outputs and signals needed for the data acquisition a of the front-end electronics. The BCID is a 12 bunch-cross identification within an orbit. It i a counter triggered by every tick of the LHC by a bunch-cross reset signal (BCR). The long word used for the Level 1 Accept (L1 within a particular lumi-block. Similarly to L1ID is a counter triggered by the L1A signa event counter reset signal (ECR). The TTC the LHC clock, the L1A, the BCR and the E the TTC network, and computes the BCID an module implements also a way to generate signals and counters when the system is not TTC network. This capability is used for work in standalone mode in the lab. Anothe performed by the TTC Decoder is to do the f ong those with ion dataflow. ARE DEVELOPMENT rface a complete TileCal e has been developed. The to the old system (ROD y for one mini-drawer, and e readout of the four minimodule (48 PMTs). The commands to the front-end and has been tested with examples of operational 1 DAC for charge injection calibration tests, set the 3-in-1 switches fo charge injection, set the offsets for the ADC Main board, set the high voltage (HV) value the HV Opto board, read-back the HV and te on the PMTs, etc.
Figs. 9 and 10 show the result of a charge the system for high gain and low gain respec amount of charge is injected in the 3-in emulating a received light pulse on the PM pulse is digitized, collected and formatted board and transmitted to the sROD demonst sROD firmware extracts the event samples channel and gain correspondent to the desir the data samples and sends them to the RO VME crate for further read-out through th system. Fig. 9 . Readout of charge injection a pulse applied amplified using high gain, digitized and readout by the sR Fig. 10 . Readout of charge injection a pulse applied amplified using low gain, digitized and readout by the sR VIII. FUTURE PLANS In a near future, the firmware will de reconstruction algorithms implemented in massive parallel fashion. Energy and time re cell will be computed for the selected even read-out chain. The eploy also signal FIR filters in a econstruction per nts in the TileCal implement the econstruction by es, allowing the processing of the information of ea opposed to the classical DSP seq preliminary tests have been alread prove the availability of DSP resou channels within the Virtex 7 FPGA. (FIR) filter has been described structures (direct and transpos parameters such the filter order (nu word length or coefficient word leng by the synthesis tools when a mult VHDL code. The filter coefficients They are channel dependent and ca of the detector. The filter has a me these coefficients in a memory at the The first tests have shown enco they are still very preliminary. The suitable digital filter to reconstruct every channel per bunch crossing ne work in terms of development and v the design of the optimal architect speed and precision of the reconstru
The signal processing on the sRO limited to the energy and time reco real time algorithms that rely on the be also computed for their use on algorithms include, for example, M particles that crossed the detector range) or Missing ET computation energy). In the case of the d processing for the L1Calo, some sp be defined, developed and validated case for the tower energy compu energy within a series of consec different detector pseudo-rapidity an ach channel in parallel, as quential processing. Some dy performed in order to urces for reconstructing 48 . A finite impulse response using the most common sed), with configurable umber of coefficients), data gth. DSP slices are inferred tiplication is found on the will not have fixed values. an differ after a calibration echanism to load and store e beginning of the run. ouraging results, however e achievement of the most the energy and the time in eeds deep study and further validation. The goal will be ure in terms of resources, cted magnitudes. OD data acquisition is not onstruction. More complex ese two magnitudes need to n physics analysis. These Muon Tagging (search of within an specific energy (search of lost transverse detector information prepecific algorithms have to d as well. This will be the utation (calculation of the cutive cells delimited by ngles). 
ES
