Present-day FPGAs contain dedicated hardware DSP blocks, giving the designers the possibility to implement these algorithms in a very efficient way while exploiting the flexibility that these devices provide.
Present-day FPGAs contain dedicated hardware DSP blocks, giving the designers the possibility to implement these algorithms in a very efficient way while exploiting the flexibility that these devices provide.
A prototype of a FPGA-based Processing Unit mezzanine card has been developed to perform studies about the implementation of the Optimal Filtering algorithm in a low cost FPGA. This prototype has been designed to be fully compatible with the present-system, but is also suitable for studying the possibilities of providing extended functionalities.
I. INTRODUCTION TO THE ATLAS TILE CALORIMETER
T HE Large Hadron Collider (LHC) is the most powerful particle accelerator in the world, which has been built at the CERN facilities in Geneva (Switzerland), and is designed to handle proton beam collisions at 14 TeV in the center of mass. It comprises four experiments: ALICE, CMS, LHCb and ATLAS [1] , the last one being a general purpose particle detector composed of many subsystems. The ATLAS hadronic Tile Calorimeter (TileCal) [2] is one of the sub-detectors, which is designed to measure the energy carried by the particles produced in the collisions.
A. ATLAS TDAQ
The Trigger and Data Acquisition System (TDAQ) is the software that manages the hardware resources for the read-out and trigger of the detector. It operates in an event-by-event 978-1-4673-0120-6/11/$26.00 ©20 11 IEEE 814 basis and is defined in terms of three domains in the dataflow, called levels of trigger. 
B. Front-End Electronics
TileCal is based on scintillating plastic tiles arranged in cells that produce light in the interaction with the particles. The light produced in a cell is converted to an electrical pulse in a photomultiplier tube (PMT), which is digitized first, and read out on subsequent steps, forming a TileCal read-lout channel. The detector geometry is a hollow cylinder divided into four barrel partitions (EBA, LBA, LBC and EBC), each containing 64 modules arranged azimutally around the interaction point, which are able to hold up to 48 PMTs. The read-out of the whole calorimeter consists of 9856 channels that are serialized and transmitted at 680 Mbps, using optical fibers, to the Back End electronics, which is non-radiation-tolerant and is placed in a separate cavern. The Read-Out Driver (ROD) [3] is a key element of the TileCal data acquisition chain. Here, data coming from eight Front-end links are collected, de-serialized and routed to two Processing Unit (PU) mezzanine cards, where the first processing algorithms of the TDAQ dataflow are applied in commercial DSPs. The maximum rate for these algorithms is defined by the Level 1 trigger rate (100 KHz). Processed data are sent to the Output Controller FPGAs, which package them into the ROD data fragment. These data fragments are then serialized and transmitted to the Read-Out System (ROS), located on the Level 2 trigger. One of the algorithms that run on the ROD PU DSPs is the Optimal Filtering (OF). It is a reconstruction mechanism based on extracting a set of parameters that define the front-end digitized pulses. These parameters are the amplitude A, the phase T, the pedestal p, and the quality factor QF. All of them are obtained by applying a linear combination to the samples of the related pulse.
B. Target o/the work
Present-day FPGAs provide the possibility to implement signal processing algorithms in dedicated DSP hardware blocks, while keeping the flexibility that these devices offer in order to upgrade the system to new functionalities by re writing the firmware.
The main motivation of this work is to study the possible benefits of using low cost FPGAs instead of commercial DSPs to execute the OF algorithm. These benefits could be observed in the following areas: 1. Higher Speed: The FPGA multiple DSP blocks, what could permit massive parallel processing in contrast to the classic DSP sequential execution. 2. Higher System Integration: Most elements of the ROD DSP PU can be implemented in a single FPGA, reducing the amount of components on the board. Besides, the implementation on programmable logic provides flexibility and versatility to the PU. 3. Present system enhancement: Many FPGA resources beyond programmable logic could be exploited in order to give an extended performance to the present system, like embedded soft or hard processors, on chip memory, clock management units, and more. All these resources contribute to provide support to System on Programmable Chip (SOPC) architectures. In order to perform these studies, a FPGA-based Processing Unit mezzanine card prototype has been developed: The FPGAPUvl. It has been designed to be fully compatible with the ROD, but is also suitable for providing extended functionalities to the PU board.
III. HARDWARE DEVELOPMENT

A. Functional Description
Data for the OF algorithm enter through the mezzanine connectors A and B. The algorithm processing takes place in the FPGA, a 484-pin AL TERA Cyclone III. Finally, the output of this processing task is sent through the Mezzanine connector C back to the ROD. The FPGA implements different configuration schemes: JT AG configures the Cyclone III directly, while Active Serial (AS) uses the data stored on a Flash memory (EPCS) to configure the device in every system power-up. The prototype hosts a set of user interfaces (switches, pushbuttons, LEOs, 10 pin header, etc) for debugging purposes. In order to study possible extended functionalities of the system, a 128 Mbit SDRAM has been also placed on the PCB. This feature, combined with the embedded soft processor IP cores available for the Cyclone III, will extend the SOPC performance supported by the FPGA.
B. PCB Details
The PCB has been manufactured in a class 6 standard, which defines minimum etch width and minimum trace spacing of 125 11m, minimum 200 11m via diameter, and 250 11m via to etch spacing. The system interconnection has been distributed in a 10-layer stack-up that contains TOP and BOTTOM, two power planes, two ground planes and six internal signal layers. The spacing between etch traces has been chosen so the normal lines are separated a distance 2w center to center, W being the trace width. For the most critical lines in the design (clock signals) some extra spacing (3w center to center) has been added to adjacent traces, in order to reduce the crosstalk effect on these signals. . Thus, a Soft-Start (SS) circuit is also needed in order to soften the fast rise times intrinsic to some DC-DC converters, like the linear regulators. The high slew rate of these converters may introduce some non-monotonicity on the voltage supplies due to overshoot and undershoot pikes that may cause problems on the FPGA start-up process. 500 mV /div -10 ms/div V CCINT and V cco PLL planes must be isolated with the use of a ferrite bead and a decoupling capacitor for avoiding switching noise from the digital components to enter the PLL analog circuitries. Besides, the power planes have been geographically split in islands for allocating the different voltage references where they are required along the board. 
D. PCB Routing
The Cyclone III comes in a 484-pin BGA package, issue that has introduced certain amount of complexity on the routing process. 
817
A total of 280 nets have been spread in six layers using 630 etch traces and roughly 1030 vias, in order to reach the BGA pin distribution of the Cyclone in a comfortable way. Fig. 12 shows an overview of the layout and routing of the PCB, using green color for the components placed on the top layer, and red color for those located on the bottom layer.
IV. FIRMWARE DEVELOPMENT
A. Phase 1: Design Validation Tests
The first versions of the firmware have been developed in order to check the proper operation in standalone mode of all the basic elements placed on the board. These tests have been successfully performed, revealing that the FPGA starts up and is configured correctly. Besides, the configuration buttons and all the user interfaces work as expected. The performance of the SDRAM memory chip has not been tested yet, since its use is planned for the Phase 3 of the firmware development, in the near future. This is the ongoing phase in the firmware development work. Here, the firmware is being developed for the operation of the FPGA PU vI as a ROD daughtercard. The target is to be fully compatible with the present system. The functionalities of the different parts that compose the DSP PU have to be implemented as HDL blocks on the Cyclone III in order to achieve on the prototype the same performance as on the DSP PU. The first versions of the firmware developed so far in this phase permit successful VME access to the FPGA PU vI from the backplane of the ROD crate. The next milestone in this phase is to be able to work in the so called copy mode, where raw data are organized in the ROD data format and passed to the ROS without performing to them any reconstruction algorithm. The last step in this phase will be to implement the OF algorithm on the FPGA dedicated hardware resources for DSP system design support. The manufacturer provides a tool that interfaces the Mathworks Simulink, which is an environment for simulation and model-based design for dynamic and embedded systems, and Quartus II, that is the synthesizer from the FPGA manufacturer. This tool makes easier the HDL code development necessary for implementing DSP algorithms in the FPGA.
C. Phase 3: Extended Functionality Evaluation
The high performance of the state-of-the-art FPGAs permit functionalities far beyond programmable logic. In the previous section the Cyclone III support for DSP processing has been introduced, but there are many other advantages to these devices. SOPCs are a reality already, even in not so recent FPGAs. The possibility of implementing embedded processors, the growing on-chip memory, or the high-speed lOs available, make FPGAs excellent candidates for growing into higher system integration. A NIOS II-based SOPC will be designed in the Cyclone III in order to perform system managing and monitoring tasks by taking advantage of the possibility to use high-level programming languages like C or C++ instead of thinking in terms of HDL code, thus simplifying the performance of such tasks.
V. CONCLUSION
The implementation of DSP algorithms in FPGAs (OF) has been the motivation for designing a prototype of PU. The prototype has been manufactured and some validation tests have been successfully performed. The firmware needed for the implementation of the Optimal Filtering in FPGA is in development phase already, having achieved some milestones like VME communication with the prototype. In the near future, studies about the parallel algorithm implementation, higher system integration and possible functionality extension for the present system will be started.
