Study and Optimization of Particle Track Detection via Hough Transform Hardware Implementation for the ATLAS Phase-II Trigger Upgrade by Alfonsi, Fabrizio <1990>
Alma Mater Studiorum · Università di Bologna




02/A1 - FISICA SPERIMENTALE DELLE
INTERAZIONI FONDAMENTALI
Settore Scientifico Disciplinare:
FIS/01 - FISICA SPERIMENTALE
Study and Optimization of Particle Track
Detection via Hough Transform
Hardware Implementation for the









Esame finale anno 2021

“Stiamo vivendo
la preistoria di una nuova umanitá.”

Abstract
The technological improvement has always been an important milestone in the High
Energy Physics (HEP) research. In the CERN of Geneva the Large Hadron Collider
(LHC) will undergo several deep upgrades in the next years. Instantaneous and In-
tegrated Luminosity will be increased respectively up to 5−7·1034cm−2s−1 and 3000
fb−1. Alongside this collider the experiments exploiting LHC will undergo through
upgrades crucial to fulfill the HEP goals. In particular, the ATLAS experiment
will continue its path to the physics beyond the Standard Model by overcoming the
technological obsolescence and upgrading its sensors and readout capabilities. The
ATLAS upgrades are divided into phases, namely Phase-I and Phase-II. The first
is going to finish in the next months and the second will start in 2024. Part of the
ATLAS upgrade concerns the Trigger and Data Acquisition systems. In particular,
for the ATLAS trigger, a big technological update is planned for the Phase-II. In
fact, to stand the increase of luminosity and pileup, this up to 200, new hardware
architectures are under development.
My contribution to these Phase-I and Phase-II plans has been focused to the
Trigger and Data Acquisition system electronic update. In the Phase-I upgrade I
worked at the commissioning of the new FELIX readout cards FLX-712 which will
be mounted on part of the TDAQ system. The features of these boards will be
exploited for the new ATLAS sub-detector New Small Wheel, which is part of the
Muon Spectrometer, and for the upgrade of the Liquid Argon Calorimeter readout.
These cards are FPGA based with a bandwidth up to 480 Gb/s and exploit PCI
Express Generation 3 technology. My work has been focused on the preparation
and the follow up of part of the tests of the cards for quality checks and controls.
The ATLAS Phase-II trigger targets to increase its output data stream to the
Tier 0 of one order of magnitude. For this increase, new methodologies and a hetero-
geneous system is under development. For the ATLAS Phase-II upgrade I developed
an implementation of a tracking algorithm to fulfill the new trigger requirements.
This algorithm, known as Hough Transform, is used to track particle trajectories
and it has been already demonstrated to be suited for the ATLAS specifications. In
this thesis I present the study, the simulations and the hardware implementation of
a preliminary version of the Hough Transform algorithm on a XILINX Ultrascale+
FPGA device. This research is now inserted in the official ATLAS upgrade plan and
will be better exploited and reviewed in 2021. In fact in 2021 ATLAS will finalize






1.1 LHC Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 LHC Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 LHC Physics Achievements and Future Plans . . . . . . . . . . . . . 8
1.3.1 CMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 ALICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 LHCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 ATLAS 11
2.1 ATLAS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Inner Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Calorimeters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Electromagnetic Calorimeter . . . . . . . . . . . . . . . . . . . 16
2.3.2 Hadronic Calorimeter . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Muon Spectrometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Magnetic System . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Forward Detectors . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 ATLAS Trigger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
I ATLAS Phase-I Upgrade 25
3 New Detector Features for Run 3 27
3.1 New Small Wheel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Liquid Argon Calorimeter . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 TDAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 FELIX Read-Out Upgrade . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 FELiX Environment and Architecture . . . . . . . . . . . . . 33
3.4.2 From on-detector to off-detector . . . . . . . . . . . . . . . . . 35
vii
viii CONTENTS
4 FLX-712 Commissioning 41
4.1 Tests Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Commissioning Status . . . . . . . . . . . . . . . . . . . . . . . . . . 51
II ATLAS Phase-II upgrade 57
5 Next ATLAS detector 59
5.1 New Goals of the Experiment . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Inner Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Pixel Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 Strip Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 High Granularity Timing Detector . . . . . . . . . . . . . . . . . . . . 66
5.4 Calorimeter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Muon Spectrometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 TDAQ 73
6.1 TDAQ Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 HTT Evolved (L1-Track) . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 HTT Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7 HTT Alternative Solutions 85
7.1 Hough Transform applied to ATLAS Phase-II . . . . . . . . . . . . . 85
7.2 FPGA Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3 FPGA Implementation of HT Algorithm . . . . . . . . . . . . . . . . 92
7.3.1 Logic Structure of the HT implementation on FPGA . . . . . 93
7.3.2 Implementation Techniques . . . . . . . . . . . . . . . . . . . 102
7.3.3 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Conclusion 117
A Aurora 64b/66b 121
B FPGA 125
C Boundary Scan and JTAG 131
D Vivado Eye Diagram 133
After the PhD period 141
Introduction
Exploring the laws of particle physics is a science domain with a well determined
procedure and general instrumentation known since decades: a particle generator
(or accelerator) made by nature or human hands, a set of particle detectors and
a system to acquire and analyze the information generated by the detectors. This
procedure is forwarded, in the broad range of particle research applications, following
the milestones required to extract essential data. In particular the experiments in
High Energy Physics create and accelerate subatomic particles, restrict them to
chosen range of energy and momentum, ”shoot” them into specific locations such
as inside a detector or against a target, make them follow a pre-determined path
as for example through a natural rock building to absorb the unwanted fragments,
extract energy, momentum, position and in the end transduce the information in a
more readable way for our eyes and our minds. All of these activities have to follow
up with the technological advancement to achieve the physics goals, advancement
in terms of new particle sensors and new technology detectors. Moreover, part of
the technological advance includes most recent digital devices such as FPGA and
CPU, more performing data transmission cables, most recent electronics and latest
design methodologies and strategies.
The research area of this thesis covers High Energy Physics (HEP), in particu-
lar the upgrade of the experiment known as ATLAS (A Toroidal LHC ApparatuS),
placed at the CERN international laboratories, Switzerland, which exploits the vast
features and performance of the Large Hadron Collider (LHC). Hundreds of activ-
ities, researches and developments, commissioning and control operations are nec-
essary to keep the experiment in the forefront physics research and technological
advancement worldwide.
From the sensor side, many features are driving the research and development:
the radiation damage resistance, the output data rate, the time and spatial resolu-
tion. The future challenge scoped by the hardware requires a radiation tolerance
to total ionizing dose up to 50 to even 100 MRad in the ATLAS nearest beam
pipe zones. Undesired conditions such as a single event effect, causing a Flip Flop
upset or a CMOS current vertically increasing because of total ionizing dose, are
well known. The detector output bandwidth is another crucial parameter to keep
up with the increasing pileup and luminosity of the accelerators. This is obtained
by forcing the front-end readout chips to go further the Gb/s. In this case too the
new nm technologies are of extreme importance. In particular the research applies
1
2 CONTENTS
the so-called heterogeneity by exploiting many types of hardware devices as CPU,
custom ASIC, GPU and FPGA. Commercial hardware has reached now 14 nm of
transistor technology, starting also the first steps in the 7 nm field of performances.
The new technology allows data stream of Tb/s for readout cards or over 10 Gb/s
per data lane.
The field of application of this thesis includes a technological development since
it uses FPGA based data acquisition systems. In addition, the field includes quality
control for commissioning 20 nm FPGA based card with capability of forefront
transmission technologies as optical links over 100 Gb/s and PCI Express Gen 3
bus. Moreover this thesis presents a FPGA based algorithm to compute information
generated by several ATLAS sub-detectors, implementing trigger functionalities and
increasing the discrimination capabilities. The reward of these are the new detecting
and readout capability of the detector becoming able to increase its resolution, its
discriminant power and the information related to an event. All these aims are
accompanied with data triggering strategies which can manage more data.
The thesis describes part of my three-year experience as PhD student of the
Bologna (IT) University, related to this experiment. This thesis is divided in two
parts, respectively for the ATLAS Phase-I and Phase-II upgrade. The first two
chapters are a description of the current working conditions of LHC and ATLAS,
with their achievements and plans for the future. Then the first part starts describ-
ing the work done in the Trigger and Data Acquisition (TDAQ) system upgrade
during the Long Shutdown 2. In the third chapter there is a description of the
upgrades operated in this period at the ATLAS sub-detectors. In chapter 4 there
is a description of the Front-End Link eXchange (FELIX) system, its roles in the
TDAQ upgrade and my work in it. Then the part 2 starts with a chapter (5) de-
scribing the upgrades for which the ATLAS sub-detectors are planned to undergo
in the Long Shutdown 3, after Run 3. Chapter 6 describes the planned strategies
for the Phase-II TDAQ system and finally Chapter 7 discusses a Phase-II ATLAS
TDAQ tracking trigger alternative solution with the work done by me for a proposal




The Large Hadron Collider (LHC, [1]) is a 27.6 Km two-rings superconducting pro-
ton/ion accelerator. It is located in the CERN (Centre Europeén Recherche Nu-
cleáire) institute, crossing the Switzerland/France border (Figure 1.1). Thousands
of researchers, engineers, technicians, students and others work to keep it the most
powerful particle accelerator in the world. Over the last twelve years this apparatus
has made our knowledge of the physics rules deeper and stronger, reinforcing its im-
portance in the history of science. In the next sections the structure, the parameters,
the achievements and the road-map of this particle accelerator are described.
Figure 1.1: LHC position between France and Switzerland border.
3
4 CHAPTER 1. LHC
1.1 LHC Structure
LHC was officially inaugurated on 21 October 2008, after a month and a half from
the first proton successfully fired along the entire circuit. It was installed in the
pre-existing cave built for the Large Electron-Positron collider (LEP), 100 m under-
ground. Because the particles used in the experiments are charged, a high frequency
electromagnetic field is used to accelerate protons or lead ions (208Pb82+). These par-
ticles are initiated in other apparatuses before reaching the LHC. For the protons
these are: the linear accelerator LINAC2 (LINear ACcelerator 2), a Proton Syn-
chrotron Booster (PSB), a Proton Synchrotron (PS) which besides accelerating the
particles it also conveys them into packets and in the end a Super Proton Syn-
chrotron (SPS). In the pre-LHC chain after extracting the protons from a small H2
silos, these are sent to the LINAC2 and speed-up to 50 MeV. Then the beam created
is injected in the PSB at 1 Hz and accelerated up to 1.4 GeV/proton. After this the
beam enters the PS and is accelerated from 1.4 to 25 GeV/proton and then they
are ready to go to the SPS where the proton ray is increased in energy until 450
GeV/proton. For the lead ions [33] instead there are, before entering the LHC, a lin-
ear accelerator called Linac3 taking the heavy ions at an energy of 4.5 MeV/nucleon
and a Low Energy Ion Ring (LIER) accelerating Pb ion at 72 MeV/nucleon. Then
they enter the PS and follow the same path as the proton before entering LHC
reaching an initial energy 5.9 GeV/nucleon and then 177 GeV per nucleon. In the
last ride, LHC accelerates Pb ions at 1.38 TeV per nucleon. Figure 1.2 shows LHC
chain.
The last step of this chain, before the collision, is given by the LHC operations.
Two 26,7 km rings form the major structure where the proton packets run in opposite
verse each other. The acceleration of the charged particles is fulfilled by a set of
Radio Frequency (RF) cavities, which also let to compensate the synchrotron energy
loss. This is a phenomenon to be avoided as much as possible because provokes a
relevant loss of energy. In fact this was one of the major motive to a proton-proton
collider. This emission happens when a charged particle is accelerated in a circular







it can be seen that the amount of loss is driven proportionally by the energy of the
particle raised to the fourth power, making the loss very important at high energies.
Fortunately, because the loss is reduced by the mass of the particle by a four exponen-
tial, the use of protons instead of electrons and positrons (massprot ∼ 2000 ·masse)
lets to better control and reduce the loss. Besides accelerating them, the packets
need to be held in a circular motion and to be focused in all the directions. To keep
the packets in the circular track a set of 1232 Niobium-Titanium superconducting
dipole magnets is used, each one able to produce a magnetic field of 8.3 T. The
superconductivity is essential to reach the magnetic field needed and consequently
1.1. LHC STRUCTURE 5
Figure 1.2: LHC acceleration chain and experiments.
6 CHAPTER 1. LHC
to achieve the energy required in the center-of-mass of the collision, 14 TeV. To fo-
cus the protons perpendicularly to the beam pipe a set of 858 quadrupole magnets,
put one after another and perpendicularly with respect to the poles, is used. This
setup compresses the packets in the two directions perpendicular to the motion,
alternately, because the quadrupole magnets are always in couplet rotated of 90◦.
Furthermore, other multi-pole magnets are used all over LHC. The RF cavities focus
the packets along the beam-pipe. The particles fly in vacuum chambers inserted in a
long mechanical tunnel, part of the LEP before LHC. This type of structure wasn’t
well suited for LHC design because of the one-ring architecture of LEP, especially
for the magnets. But this was solved using twin bore magnets, consisting in two
sets of coils. The low temperature of 2 K required is given by liquid Helium. Figure
1.3 shows the section of LHC.
Figure 1.3: Section of the LHC beam pipe.
1.2 LHC Parameters
LHC has been designed to feature an Instantaneous Luminosity of 1034cm−2s−1 and
a center of mass energy of 14 TeV. The protons, after all the pre-LHC chain, reach
an energy of 450 GeV/proton and are merged in packets not initially focused at
the required level. Each packet contains sort of 1011 protons. This feature is very
important to reach a number of collisions between protons per bunch-crossing (pile-
up) as high as possible (∼ 15 − 50 along Run 1 and 2, ∼ 150 − 200 targeted for
High Luminosity LHC). The focusing inside LHC, also increasing the number of
collisions, allows to establish an important characteristic at which all the detectors
using LHC need to conform, the 40 MHz collider frequency. To summarize, two
∼ 1011 proton packets at the same energy and speed, direction and opposite verse,
1.2. LHC PARAMETERS 7
collide every 25 ns. Each packets is 16 µm squeezed radially and 30 cm long. Because
of the accelerator, the energy reached at the collision center-of-mass is given by
relativistic formulation, due to the speed gained by the protons in the acceleration
chain, 0.999999991 of the speed of light. The Lorentz factor is 7460. Today the
energy reached is 13 TeV for the p-p collisions. The Instantaneous Luminosity L,
which expresses the collider performance, is defined as
L = f n1 · n24π · σx · σy
(1.2)
where
• ni is the number of particles in the accelerator;
• f is the revolution frequency of the bunches;
• σx, σy is related to the transverse dimensions of the beam.






2 · nb · frev · γr
4π · εn · β∗
F (1.3)
where
• nb is the number of bunches inside the ring;
• Nb is the number of particles per bunch;
• frev is the revolution frequency of the bunches in the accelerator;
• γr is the relativistic Lorentz factor of the particles;
• εn is the normalized transverse beam emittance;
• β∗ is the beta function of the collision point;
• F is the geometric luminosity reduction factor, due to the crossing angle of
the two beams at the interaction point.
Conceptually, L represents the capability of the apparatus to generate physics
events, basing on the energy and density of the particles. This parameter, along
with the run time and the cross section of the relevant physics event searched for σ,
drive the number of searched events produced in a Run. With these ingredients we




L · dt (1.4)
which coupled with the cross section gives the total number of events in a Run
Ne = L · σe.
8 CHAPTER 1. LHC
1.3 LHC Physics Achievements and Future Plans
[1][7][12][34] After having described the structure and primary parameters of LHC
we now overview the road-map of this apparatus, including a description of the
main experiments, and their relative detectors. The focus will be on the topic of
this thesis, the ATLAS experiment. The accelerator timeline is divided in the so-
called Runs and Long Shutdowns, where the former are the working periods of the
collider and data taking by the experiments, while the latter represents the stop
periods due to upgrades required by the accelerator and detectors. In the first
period of operation (Run1) LHC instantaneous luminosity reached 7.7 ·1033cm−2s−1
and the protons energy ranged from 900 GeV up to 8 TeV. The bunch crossing
time was 50 ns, double with respect to the design specifications. Energy and L
were over the half of the target features, very promising at that time. With these
parameters the Higgs boson was observed (2012). Run 1 concluded at the beginning
of 2013, and was followed by the Long Shutdown 1. This period was characterized
by a discrete amount of work on the collider, for example the consolidation of the
magnet interconnections. This was worthy because in Run 2 (2015-2019) almost
all the design parameters were achieved, with L of 1.58 · 1034cm−2s−1, thanks to a
bunch crossing period of 25 ns and an energy reached at the center-of-mass collision
of 13 TeV. Today’s phase is the Long Shutdown 2, which will go on until 2021.
In this period the Linac2 will upgrade to Linac4, to reach a design energy of 14
TeV and L above design specification, ∼ 2 − 3 · 1034cm−2s−1. All these features
should bring a 300 fb−1 of Integrated Luminosity during Run 3 (2021-2023). After
that, the Long Shutdown 3 (2023-2026) will inaugurate the High Luminosity LHC
project, with the technological follow up of 5− 7 · 1034cm−2s−1 of L. This could be
achieved by enhancing the bending magnetic field from 8.33 to 13 T. The 14 TeV
threshold will not be surpassed and HL-LHC project should allow at the end of Run
4 an integrated luminosity of 3000 fb−1. All above is summarized in Figure 1.4 and
Figure 1.4: LHC road map.
1.3. LHC PHYSICS ACHIEVEMENTS AND FUTURE PLANS 9
Table 1.1
Parameters Nominal LHC Nominal HL-LHC
nb 2808 2748
Number of protons per bunch [x1011] 1.15 2.2
Events per proton crossing 27 140
Beam energy at collision 7 7
Table 1.1: Today LHC parameters and future targets.
On the detector side, an overview of three main experiments will be shown in
the following lines, while ATLAS will be described in the next chapter.
1.3.1 CMS
Compact Muon Solenoid (CMS) [35] is a general purpose detector designed to study
principally the new physics frontiers as physics beyond the Standard Model and some
precision measurements as the top and Higgs boson properties. The detector and
sub-detector structure is the same as ATLAS, but different technologies are used.
The magnets system that curve the charged particle to study their momentum is a
4 T solenoid confined by a steel yoke. In the last decade CMS forwarded its purpose
achieving the discovery of the new excited neutral Ξ0∗b baryon in 2011 [36], the Higgs
boson in 2012 and its decay in a bottom pair in 2018 [37].
1.3.2 ALICE
A Large Ion Collider Experiment (ALICE) [38] is a specific detector developed to
study quark-gloun plasma state of matter. It uses heavy-ion collisions besides proton
collisions. Its structure is not symmetric but conveyed to one direction. Together
with the measuring of the highest temperature ever produced by humanity, ALICE
helped in the advancing of the knowledge of the Quantum Chromo Dynamic (QCD),
Quark Gluon Plasma (QGP) and of the rarest states of matter.
1.3.3 LHCb
Large Hadron Collider beauty (LHCb) [39] is a completely asymmetric detector
using LHC to study one of the most important phenomena in physics, the Charge-
Parity (CP) violation. It is based on target collisions differently from the other
major experiments at CERN. In fact, its structure is not an onion-shape as the
others but it is made by a series of detectors one after another, completely structured
forwardly. LHCb in the last decades, even collaborating with the other experiments,
contributed to many important achievements as the discovery of a penta-quark [43],
the observation of new Ξ baryons [40] and studied in the Super Symmetry parameter
space.
10 CHAPTER 1. LHC
Chapter 2
ATLAS
ATLAS [3] is a multi-purpose experiment with the detector situated in a place known
as Point 1 at CERN. It has a symmetrical shape onion construction formula tuned
for the proton-proton collisions. The today upgrade in the current Long Shutdown 2
is called Phase-I. Its future achievements will be used as case study during the Long
Shutdown 3, in the Phase-II upgrade. These will be the second and third upgrades
that ATLAS will go through and new sub-detectors will be added to the structure.
The detector architecture now is described. The next chapter will be dedicated to
the planned Run 3 ATLAS sub-detectors and trigger.
2.1 ATLAS Overview
As mentioned before ATLAS is a symmetric cylindrical detector built around the
LHC collision Point 1, where each plane perpendicular to the beam pipe shows the
same detecting technologies and configured according to the position. An ATLAS
picture is shown in Figure 2.1 . The target of the detecting chain is to reveal the
time stamp and the spatial position of an event vertex and of the generated particles
and to extract their momentum and energy. According to these parameters then it is
decided, inside a time-window, if a relevant physics event occurred. The coordinate
system used in the experiment has its center ”0,0,0”, in X,Y and Z axis, in the
theoretical Interaction Point (IP) where proton bunches collide. The beam line
represents the Z axis. The plane perpendicular to the beam pipe, the X-Y plane,
is called ”transverse plane” and it’s fundamental in the kinetic study due to the
conservation law of physics observable such as the transverse momentum pt and
energy Et. This lets to recover from the data analysis important information such
as the missing transverse energy Emisst . The coordinate formulation used in this case
are the polar one with r as the distance to the IP, the azimuth angle φ as the one with
the X axis and the polar angle θ as the one with the Z axis. Figure 2.2 shows the
coordinate system. This system gives us the possibility to define Lorentz-invariant
11
12 CHAPTER 2. ATLAS
Figure 2.1: Scheme of the ATLAS experiment detector.
Figure 2.2: Description of the ATLAS coordinate system.
2.2. INNER DETECTOR 13
variables as rapidity y (invariant for Z axis transformations)






where pl is the particle linear momentum. Others important system units are the
angular separation between two particles expressed as rapidity done by
∆R =
√










. This representation makes possible to identify the particle position using η, φ and
Z in a new Lorentz-invariant coordinate system, which also allows to calculate the
distance between two particles with
∆R =
√
(∆y)2 + (∆φ)2 '
√
(∆η)2 + (∆φ)2 (2.4)
in case of particle with speed of light approximation. The steps applied by ATLAS
to the HEP general purpose research is schematized in Figure 2.3 . Each ATLAS
sub-detector will be described in the next sections. The first one encountered by the
new born particles is the Inner Detector system, whose architecture is structured
with three sub-detectors developed to track charged particles: the Pixel Detector
(PD), the SemiConductor Tracker (SCT), the Transition Radiation Tracker (TRT).
After that the survived particles reach the calorimetric system, which measures the
total energy of the particles and also tracks the neutral ones. These happen for the
electromagnetic interacting particles in the Electromagnetic Calorimeter and for the
hadronic interacting particles in the Hadronic Calorimeter. The last sub-detector
is the Moun Spectrometer (MS), that is composed of the Monitored Drift Chamber
(MDC), the Cathode Strip Chamber (CTC), the Thin Gap Chamber (TGC) and
the Resistive Plate Chamber (RPC). These have been developed to measure part of
the track and the total energy of muons, which have a low dE
dx
. At last a structure
of magnets shown in Figure 2.7 deflects the charged particles to measure their mo-
mentum and identify them. It is composed by two sets, one in the MS and one in
the rest of the detector.
2.2 Inner Detector
The ATLAS Inner Detector (ID) task is to identify the tracks of the charged particles,
especially for those with a short life-time, otherwise the information about them
would be lost before reaching the calorimeters. The ID information allow to further
retrieve the position of the first and secondary interaction vertex positions. The
14 CHAPTER 2. ATLAS
Figure 2.3: Scheme of the ATLAS detecting chain.
2.2. INNER DETECTOR 15
Figure 2.4: Section of the Pixel Detector with the distances of the sub-detectors and layers
from the LHC beam pipe.
ID structure is shown in Figure 2.4 , describing a system architecture based on a
cylindrical redundancy of several layers of three sub-detectors, each one pointing to
the LHC beam pipe.
• Pixel Detector (PD): the most internal detector covers a |η| < 2.5 and is one
of the most precise tracker of ATLAS. It is composed of four layers based
on semiconductor technology of different types: Insertable B-Layer, B-Layer,
Layer-1 and Layer-2. The first was added in 2014 during the Phase-0 upgrade,
is segmented of FE-I4 Silicon pixel sensor chips and distance from the beam
pipe 33.25 mm. The other three are the original ones, are located respectively
at 50.5, 88.5 and 122.5 mm and track particle using the FE-I3 pixel sensor.
• Semiconductor Tracker (SCT): together with the PD, the SCT is one of the
most precise tracker of ATLAS. It is based on microstrip technology, allowing
a better resolution along the privileged coordinate. It is composed of four
cylinders in the barrel region covering |η| < 1.1 − 1.4 and by 2 end-caps,
consisting of nine disks covering 1.1 − 1.4 < |η| < 2.5. SCT allows 17 µm
of resolution along the R-φ direction and 580 µm along the Z axis. The
barrel region is structured with eight layers of silicon microstrips. The sensor
technology is placed on modules. These modules have a surface of 6.36 x 6.40
cm2 and they are mounted on carbon-fibers cylinders at different radii and
16 CHAPTER 2. ATLAS
include 768 readout strips. The end-cap region uses tapered strips with one
set aligned radially.
• Transition Radiation Tracker (TRT): this tracker is made by 1.43 m long
barrel layers and by two end-caps. The technology used is based on 420000
carbon-polymide straw detectors filled with a set of gas mixture: Xe(70%)−
CO2(27%) − O2(3%). The particles interact with the detector due to their
speed near c, emitting transition radiation passing from one material to an-
other with an intensity depending by the Lorentz factor γ. This detector,
thanks to the high number of straws, achieved a high tracking capability.
Table 2.1 shows the spatial resolutions of these three sub-detectors.
Detector Hits/track Element size Hit resolution [µm]
Pixel, |η| < 2.5 50 x 400 µm2 (B0, L1, L2)
4 barrel layers 3 10(R - φ), 115 (Z)
2 x 3 end-cap disks 50 x 250 µm2 (IBL) 10(R - φ), 115 (R)
SCT, |η| < 2.5
4 barrel layers 8 50 µm 17(R - φ), 580 (Z)
2 x 9 end-cap disks 17(R - φ), 580 (R)
TRT, |η| < 2.0
73 barrel tubes ∼30 d = 4 mm, l = 144 cm 130/straw
160 end-cap tubes d = 4 mm, l = 37 cm
Table 2.1: Hit resolutions of the Inner Detector region.
2.3 Calorimeters
All the surviving particles that pass through the Inner Detector, except neutrinos,
interact with the calorimeter system and, excluding muons, are completely absorbed
by the structure. This gives us information on the neutral particles tracks and most
importantly the calorimeters measure the total energy of the interacting products.
A picture of it is shown in Figure 2.5
2.3.1 Electromagnetic Calorimeter
The high-granularity Liquid-argon (Lar) electromagnetic sampling calorimeter (EM-
CAL) was developed to measure particles, such as electrons, photons, π0, interacting
with electromagnetic showers. The EM calorimeter is divided in a barrel and in an
end-cap region. The first covers a |η| < 1.475 and the second is composed by two
coaxial wheels where the outer covers 1.375 < |η| < 2.5 and the inner most one copes
with 2.5 < |η| < 3.2. The EM calorimeter is 22 radiation lengths (X0) deep in the
barrel region and > 24 in the end-caps. These because > 99% of the shower energy
2.3. CALORIMETERS 17
Figure 2.5: Picture of the ATLAS Calorimeter system.
is emitted at the most in 20 X0. The detector is based on a lead-Lar structure with
accordion-shaped kapton electrodes, which geometry gives a complete φ symmetry
without azimuth cracks; also lead absorption plates are all over the detector. The
good energy resolution and the intrinsic radiation hardness justify the reasons of the
liquid Argon choice. The region of |η| < 2.5 is segmented in three parts, where the
first layer is granulated finely in η to achieve a high photon-neutral pion separation.
In the barrel region it is possible to discriminate photons and electrons between ∼5






where 9.4 % is the stochastic term and 0.1 % is the constant one. The energy
response is linear within 0.1 %. There is a ”grey zone” in 1.37 < |η| < 1.52 not used
for precision measurements because of the presence of the barrel-endcap transition
zone, where the material reaches 7 X0.
2.3.2 Hadronic Calorimeter
The Hadronic Calorimeter (HCAL) purpose is the same as the EMCAL with a
different elementary force targeted: the Strong force and in particular each particle
interacting with hadronic jets. In this case, for example, the study of the missing
18 CHAPTER 2. ATLAS
transverse momentum can be done. The HCAL is divided in three sub-detectors
using each one LAr technology:
• Hadronic Tile Calorimeter: its pseudorapidity range is |η| < 1.7. This scintillator-
tile calorimeter is composed of one central barrel and two smaller extended
one at each side of the biggest cylinder. The covered interaction length is
respectively 4.0, 1.4, and 1.8. This detector uses steel as absorber and scin-
tillating tiles as active material. The energy response to isolate charged pions
(this type of particle was used during the test beam) for combined Lar and






• Hadronic End-Caps Calorimeter: hadronic and electromagnetic particles are
detected in this region by applying Lar sensing technology. Copper is used







• Forward calorimeter: this block provides hadronic and EM measurements.
The absorbers are made of copper for the first layer and tungsten for the other







As mentioned before, some particles as neutrinos are not detected, but can be studied
via the missing energy in the detectors regions. Muons can be detected directly,
but their very low dE/dx makes necessary the use of dedicated apparatus, even
because the calorimeter system doesn’t absorb all the muons energy. For ATLAS
the Muon Spectrometer (MS) scope is detecting energy and momentum of muons
and, alongside the calorimeters, represents the trigger controller signal of the entire
experiment. The spectrometer, shown in Figure 2.6 , is composed by four sub-
detectors, making it the most sized apparatus of ATLAS. The sub-detectors are
furtherly internally divided depending on their purpose:
Precision Chamber
– Monitored Drift Tubes: drift chambers of two layer drift tubes, with 30
mm diameter aluminum walls filled with Ar at 93 % and CO2 at 7 %.
2.4. MUON SPECTROMETER 19
Figure 2.6: Overview of the ATLAS Muon Spectrometer.
The detector’s specialty is in z coordinate precise measurement in the
barrel region, where it covers |η| < 2. 80 µm of resolution are achieved.
The detection measures the drift time of the particle in a single tube to
estimate the position.
– Cathode Strip Chambers: multi-wire chambers with strip cathodes for the
measurements of muon momentum. The wires are composed of parallel
anodes which are perpendicular to 1 mm large strips of opposite polarity.
The anode-cathode distance covers the space between the anode wires,
typically 2.5 mm, with a time resolution of 7 ns and a corresponding
spatial resolution of 60 µm and O(10−2) m, respectively in the φ direction
and η. The pseudorapidity covered is 1.0 < |η| < 2.7.
Trigger Chambers
– Thin Gap Chambers: a very thin multi-wire chamber placed in the end-
cap region. The anode-cathode spacing is smaller than the anode-anode
one, leading to a very short drift time (< 20 ns). As a resolution of 4
ns is required, and to achieve them the TGC works in saturation regime.
The chambers are filled with a gas mixture of 55 % of CO2 and 45 % of
n-pentane C5H12, very quencing. In the radial direction and φ coordi-
nate the spatial resolution is 4 mm and 5 mm respectively; furthermore
these chambers allow a better measurements along φ coordinate from the
precision chambers.
– Resistive Plate Chambers: these apparatus are made of gaseous parallel
electrode-plate, and reach a spatial resolution of 1 cm in the two coordi-
20 CHAPTER 2. ATLAS
nates and a time resolution of 1 ns. Pick-up strips read out the rectan-
gular layer system, two orthogonal strips each rectangle. η is recovered
by strips parallel to MDT wires while φ by orthogonal ones. An electric
field of 4.9 kV/mm allows the formation of the avalanche generated by
the muon. The signal is read out by both sides of the chamber through
capacitive coupling strips. RPC and TGC cover |η| < 2.4.
2.4.1 Magnetic System
Figure 2.7 represents a scheme of the complete magnet curving system of ATLAS.
Figure 2.7: Graphical view of the ATLAS magnet system.
Along all the detector it curves the charged particles generated to measure their
curvature radius, in order to extract their momentum. Its structure is divided in
three sections: the central solenoid, the barrel toroid and two end-cap toroids. The
first is between ID and the calorimeters, with a shape of 5.3 m long and 2.4 m of
diameter, producing 2 T at 4.5 K. The barrel toroid is composed of eight flat super-
conducting race-track coils, 25.3 m long; it produces 4 T operating at 4.5 K. The
magnetic end-cap toroids are positioned inside the barrel toroid, as shown in Figure
2.7, and produce 4 T with 4.7 K. It is tilted of 22.5◦ with respect to the barrel toroid
coil to optimize the bending power by radially overlap the two coil systems.
2.4.2 Forward Detectors
Figure 2.8 shows the complex infrastructure of the ATLAS Forward Detector.
• LUminosity measurement using Cherenkov Integrating Detector (LUCID) is
a Cherenkov counter divided in two detectors symmetrically positioned in
the two forward regions. To monitor the luminosity, the Cherenkov light is
detected by 16 photomultipliers in each region, light emitted by quartz crystals.
2.5. ATLAS TRIGGER 21
Figure 2.8: Position and scheme of the three ATLAS forward detectors.
• Zero-Degree Calorimeter (ZDC) is a neutron detector for the forward region.
Placed in both sides of ATLAS, it is composed of an electromagnetic module,
three hadronic modules (made by tungsten as absorber and quartz rods for
the energy detection) and photomultipliers at the end. It covers |η| > 8.3.
• ATLAS Forward Proton (AFP) is placed in the two forward sides of the detec-
tor, exploiting a 3D silicon tracker and a time-of-flight apparatus to measure
the momentum and the energy of forward protons.
• Absolute Luminosity for ATLAS (ALFA) is the furthest sub-detector of AT-
LAS, 237 m from the IP. It exploits staggered layers of scintillating fibers
shaped as squares and photomultiplier tubes to measure pp scattering at small
angles.
2.5 ATLAS Trigger
The Research and Development activity that I personally did through this thesis
work is related to the future ATLAS Trigger and Data Acquisition (TDAQ) system.
In this section an overview of the present TDAQ is described. A description of the
future versions for Run 3 and 4 is presented and explained in the next sections.
ATLAS Run 2 Trigger is a two-level structure with the fundamental task of
reducing the data to be saved at the end of the Data AcQuisition (DAQ) system
of the experiment. This is done by extrapolating only the information related to
the ”relevant” events studied in that particular moment. A scheme of the trigger
22 CHAPTER 2. ATLAS
architecture is shown in Figure 2.9 . As already said, during the run many events
Figure 2.9: ATLAS TDAQ system.
are processed by the DAQ of the whole detector. The entire set of the events can’t
be saved because it would require a memory not compatible with the storage tech-
nologies used as hard disk, tapes and their costs of production and maintenance
(about hundreds of PetaBytes (PB) of data produced per year) would be too high.
Thus only few events can go through all the DAQ chain and be stored for further
analysis. The trigger starts a signal by the Muon Spectrometer and the Calorime-
ter system together. In the first level (L1), hardware based, the firsts subsystems
working together are the L1 calorimeter (L1Calo) and the L1 muon spectrometer
(L1Muon). Then the data are processed and the Central Trigger Processor (CTP)
sends a Level-1 Accept (L1A) signal to the front-end readout electronics. In this the
Minimum Bias Trigger Scintillators (MBTS), the LUCID Cherenkov counter and the
ZDC are involved too. This set of sub-detectors gives to the trigger signatures as
high-pt muons, electrons/photons, jets, τ leptons decaying into hadrons and missing
transverse energy. The data passing through the hardware discrimination undergo
the ReadOut Driver (ROD) structure which applies fragment building and associ-
ated error detection, data checking, transformation and monitoring. Then the data
are received by a readout device called Read-Out System (ROS) which sends the
information to the High-Level-Trigger(HLT), a processor farm exploiting 28k CPU
to rapidly investigate the Region-of-Interest (RoI) identified by the L1. The RoI
2.5. ATLAS TRIGGER 23
analysis is based on the data received and by the reconstruction of regional tracks,
using in this second process all the information coming from all the detectors in the
selected RoI. After these on-the-fly stages, the data from the accepted events are
sent to the local storage at the experimental site and lastly exported to the Tier-0
facility at CERN’s computing center for offline reconstruction. As mentioned, the
trigger task is a first level of discrimination of the data needed to be saved in the
Tier-0. The huge amount of storage required and the low writing speed on the last
stages of data registering are overcome by reducing the frequency of the data coming
from detectors. The L1 goes from 40 MHz to 100 kHz, with the data held in buffers
and time stamped in the front-end circuit. The HLT achieves a further reduction to
0.4-1 kHz.






New Detector Features for Run 3
During the drafting of this thesis, the ATLAS experiment is in the middle of the
LS1, event resulting in an upgrade of the detector in some sub-structures, including
adding new sub-detectors. These changes are required for several reasons including:
• to test the DAQ system concept that will manage ATLAS during Run 4 , the
FELIX system;
• to extract more precise measurements of the Higgs Boson;
• to upgrade sub-detectors in anticipation of ATLAS Phase-II;
• to continue the research and study of new physics;
These targets will be mainly followed by an enhancement of the luminosity to ∼ 2−
3·1034cm−2s−1 and focusing the upgrade on the Level-1 trigger. Detailed prospected
features and scopes of them will be shown in the next sections.
3.1 New Small Wheel
The Muon Spectrometer will be modified mostly in the end-cap region, between 1.0
and 2.7 |η|. This change plans to solve two issues of the current sub-detector that
could occur at high luminosity: the reduced acceptance of good muon tracking and
a too high rate of false high-pt L1 muon triggers coming from the forward direction.
The upgrade foresees the substitution of the ”Small Wheel” sub-detector in the
inner end-cap region with the ”New Small Wheel” [5][42], characterized by improved
spatial and time resolution and capable of sustaining the increased particle rates.
Table 3.1 shows the different expected L1 rates, with or without upgrade. Without
any modification, today end-cap technology could reduce the detection efficiency for
low pt lepton signal of L1 trigger and a resulting loss of high quality tracking in
this region (especially for high momentum, over 100 GeV). The NSW is expected to
work at up to 15 kHz/cm2 at |η| ∼ 2.7 and the resolution at spatial level is targeted
at 100 µm needed to maintain a pt resolution at 1 TeV. The Figure 3.1 describes
27
28 CHAPTER 3. NEW DETECTOR FEATURES FOR RUN 3
L1MU threshold (GeV) L1 rate (kHz)
pt > 20 60 + 11
pt > 40 29 + 5
pt > 20 barrel only 7 + 1
pt > 20 with NSW 22 + 3
pt > 20 with NSW and EIL4 12 + 2
Table 3.1: ATLAS L1 rates of muon threshold events in case of upgrade and not.
Figure 3.1: Scheme of the New Small Wheel detector technology.
the technology of the sub-detector based on micro mesh gaseous structure. Here a
muon crosses a thin wireless gaseous detector, a planar drift electrode generates the
ionization producing positive and negative couples, where the negative particle are
then attracted by the electrostatic field in the amplification area generated by the
metallic mesh, directing the amplification products to the readout electronics. The
NSW detector is composed of two wheels, each one of eight sectors in the front and
back side, totally of 32 sectors. Two MicroMegas Wedges packed in two Small-Strip
TGCs (STGCs) compose one sector, where each one is made of 93 % Ar and 7 %
CO2, reaching 10 MΩ/cm of resistivity and 104 of amplification factor.
3.2 Liquid Argon Calorimeter
The improvement of the L1 trigger will also be pursued by upgrading the Liquid
Argon Calorimeter (LAr) [6] read-out. The goals to fulfill are better granularity,
resolution and more longitudinal shower information during the trigger processes.
The 10-fold increase in granularity proposed shown in Figure 3.2 allows a better
energy study for the electrons, as shown. The proposed granularity is called ”Super
Cell” granularity, and Table 3.2 shows the targeted values in terms of pseudo-rapidity
and azimuth angle φ. This change enables the use of shower-shape variables for a
more effective identification of electrons, photons and τ leptons and to sharpen the
electromagnetic (EM), jet and EmissT efficiency turn-on curves. A new LAr Trigger
3.2. LIQUID ARGON CALORIMETER 29
Figure 3.2: Scheme of a 70 GeV transverse momentum electron interacting with the ex-
isting L1 calorimeter trigger electronics (a) and with the proposed strategy (b).
30 CHAPTER 3. NEW DETECTOR FEATURES FOR RUN 3
Elementary Cell Trigger Tower Super Cell
Layer ∆η ×∆φ nη × nφ ∆η ×∆φ nη × nφ ∆η ×∆φ
0 - Presampler 0.025 x 0.1 4 x 1 4 x 1 0.1 x 0.1
1 - Front 0.003125 x 0.1 32 x 1 0.1 x 0.1 8 x 1 0.025 x 0.1
2 - Middle 0.025 x 0.1 4 x 4 1 x 4 0.025 x 0.1
3 - Back 0.05 x 0.025 2 x 4 2 x 4 0.1 x 0.1
Table 3.2: Phase-I LAr upgrade cells coverage for the various region of the sub-detector.
Digitizer Board (LTDB) read-out electronics has been developed and produced,
which also includes the infrastructure due to the new Layer Sum Board and new
Base Plane. The LTDB, shown in Figure 3.3 , has been developed to adapt and
Figure 3.3: LAr Trigger Digitizer Board for the Phase-I LAr Calorimeter upgrade.
digitize the Super Cells signals.
3.3 TDAQ
The Trigger and Data Acquisition [7] system of ATLAS will undergo a particular
upgrade that will add aimed technologies and strategies to fulfill the Run 3 requests
and anticipate some upgrade operations for Run 4. These, along with the reasons
mentioned before, are also due to the pile-up growth, with an expected number of
interactions per bunch crossing up to 80 during Run 3. The current system, shown
in Figure 2.9, will become a structure described by the scheme in Figure 3.4 . The
main parts to be upgraded are:
• the L1 Calorimeter Trigger electronics, which will include the new electron
Feature EXtractor (eFEX), the jet Feature EXtractor (jFEX) and the global
3.3. TDAQ 31
Figure 3.4: Scheme of the ATLAS Phase-I TDAQ chain.
Feature EXtractor (gFEX) processors cards, respectively the hardware for
electron-based, jet-based, large R-based trigger selection;
• the RODs and the optical plant;
• the L-1 Muon Trigger sector logic for the end-cap;
• the Central Trigger Processor (CTP) and the muon-to-CTP (MUCTPI);
• a new L1 Topological Processor, which will perform real-time event selection
studying the interconnections between trigger objects (muons, jets, Emiss, elec-
tron and tau with variables such as HT , Meff and Minv);
• a new Tile Calorimeter muon trigger;
• a NSW trigger processor;
The new trigger functions have to be performed within 2.7 µs. The upgrade will
regard even the Read Out Software (ROS) with a hardware update. The Level-2
event builder and the High Level Trigger (HLT) will be merged in a HLT PC farm.
One of the most relevant TDAQ upgrades is a new electronic readout card that will
be used in the new detectors and it will be the readout electronic targeted for Phase-
II. This is the FLX-712 card, develped by the Front-End Link EXchange (FELIX)
[19] collaboration.
32 CHAPTER 3. NEW DETECTOR FEATURES FOR RUN 3
3.4 FELIX Read-Out Upgrade
The first Part of this thesis is a description of the general state of the experiment
driving the LS1 operations and it mainly concerns my personal contribution to the
ATLAS Phase-I upgrade. In more detail I have developed from scratch the setup and
part of the tests for commissioning and production of the FELIX cards FLX-712.
The three sectors that will exploit the new FELIX technology in the ”off-detector
region” are the NSW, part of the LAr and the TDAQ. The scheme of the NSW
FELIX structure is described in Figure 3.5 . Starting from the bottom, GBT E-
Figure 3.5: Overview of the NSW FELIX system.
links organized per function to apply and per FE chip connect on and off-detector
areas through optical transmission. The 48 high-speed links of the FLX-712 then
undergo a protocol processing and data organizing to send the front-end readout
stream at the Components Off-The-Shelf (COTS) network switch. This manages
the data transmission to the ROD, the FE configuration and the Detector Control
System (DCS) reception of the FELIX cards, using a 40 Gb/s per lane ethernet net-
work. A Busy and a Trigger and Timing Controller (TTC) nets are the bottleneck
handler and synchronizer of the architecture. About the LAr, the FELIX system
3.4. FELIX READ-OUT UPGRADE 33
enters in the LAr Digital Processing System (LDPS), the main interface between
the FE and the TDAQ. Figure 3.6 shows the LDPS structure and its position in the
Figure 3.6: Scheme of the LAr Calorimeter readout for ATLAS Phase-I.
LAr architecture. The FE, in the left side of the picture, via the LTDB transmits
data at 40 MHz to the off-detector electronics. The receiver is the LDPB which
communicates with a PC farm for data monitoring via an ethernet network up to 40
Gb/s. It also talks with the FELIX system through GBT links and it is managed by
a Partition Master (PM) PC for monitoring, reprogramming and configure it and
by the TTC partition. The LTDB TTC configuration and monitoring is managed
by another FELIX system to interface with the detector via GBT links. The FELIX
blocks connect with the DCS ATLAS system and the one managing the front-end
data is provided with a Busy signal for bottleneck control. For the TDAQ infras-
tructure, Figure 3.7 explains the role of the FELIX system in the data acquisition
engine. It contains the two structures shown before of NSW and LAr. The next
pages will describe the FELIX environment and the operations implemented for the
production.
3.4.1 FELiX Environment and Architecture
The Front-End LInk eXchange (FELIX) system was born with the intent of deliv-
ering a generic, versatile, easy to setup and use DAQ architecture, by implementing
common and configurable devices. These guidelines resulted in the development of
34 CHAPTER 3. NEW DETECTOR FEATURES FOR RUN 3
Figure 3.7: Overview of the ATLAS TDAQ upgrades for Phase-I focused on the FELIX
readout.
3.4. FELIX READ-OUT UPGRADE 35
a DAQ system based on a ”evaluation-like” FPGA board with only two main I/O
technologies: the high-speed optical fibers (up to 768 Gb/s) and the PCI Express
Generation 3 (up to 120 Gb/s). The FPGA based readout card is connected to a
PCI Express server. This development choice will lead, starting from the Phase-I
and completing in the Phase II upgrade of ATLAS, to an unification of all the de-
tector front-end DAQs. This project has been and is forwarded by a collaboration of
many universities and laboratories such as Brookhaven National Laboratory (BNL),
Amsterdam National Institute of Sub-atomic Physics (Nikhef), CERN and others.
The current status of the project is the commissioning acceptance tests completion
of the Phase-I hardware and software devices that will run on some of the Run 3
ATLAS new sub-detectors as described before.
3.4.2 From on-detector to off-detector
The area of activity of the FELIX system starts from the input and output of
the front-end modules. These are carried out by devices that collect I/Os of the
front-ends, using optical fiber cables to connect the on-detector, radiation tolerant,
electronics to the off-detector zone. The data communication is bidirectional, where
the data to the detector (transmission) manages the front-end chips control and
configuration. This communication is implemented with the GigaBit-Transceiver
(GBT) FPGA protocol. The reverse path (reception) acquires the data coming from
the detector calibration and detection procedures, using a specific protocol called
FULL Mode. Another mode, as shown above, is the transmission and reception
handled by GBT FPGA protocol (for example the LTDB case). These two data
protocols are performed by the high-speed I/Os of the read-out card FLX-712, shown
in Figure 3.8 . As said before this board is a custom FPGA based card developed
to put in communication the front-end chips with the first DAQ servers. The high-
speed I/Os to the front-end and the PCI Express (PCIe) Gen 3 to the DAQ PC are
controlled by a Xilinx Kintex Ultrascale FPGA. Focusing on the FLX-712 hardware
features, the card exploits:
• a Multi-fiber Termination Push-on (MTP) connector letting 48 optical chan-
nels in both directions, connected to the FPGA by 96 optical fibers merged in
eight bundles. The transduction optical-to-electrical and vice versa is done by
8 12-channels devices named MiniPOD;
• a FPGA Kintex Ultrascale XCKU115-FLVF1924-2E;
• a PCIe Gen 3 16 lanes, with a bandwidth up to 120 Gb/s;
• a PCIe switch PEX8732, to connect the two 8 lanes PCIe separated firmware
structures to the 16 lanes connector;
• 2 Gb flash memory to store the various firmware to reconfigure the FPGA
functionality completely and a micro-controller ATMEGA324A to control the
FPGA and flash memory re-configuration remotely via the DAQ PC SMBus;
36 CHAPTER 3. NEW DETECTOR FEATURES FOR RUN 3
Figure 3.8: Picture of the FLX-712. 1) FPGA Kintex Ultrascale, 2) MiniPOD, 3) PCIe
Gen 3 connector, 4) MTP connector, 5) TTC mezzanine), 6) PCIe Switch, 7) Power
manager.
• a Timing and Trigger Control (TTC) mezzanine to receive the LHC 40 MHz
clock using an ADN2814 clock and data recovery, as clock stabilizer and
cleaner, mounted mechanically on the board;
• 16 layers stack-up PCB;
• a stiffener metal plate mechanically mounted below the card to maintain it
planar and absorb bowing and twisting tension. This is to avoid BGA and
chip on board damaging;
Two versions of the FLX-712 have been developed, whose differ by the number of
high-speed I/Os, 24 channels or 48. The FPGA firmware is structured as shown in
Figure 3.9 . Its purpose is to convert the data coming from any direction in the one
of the three protocols depending on the scope:
• GBT FPGA, a 4.8 Gb/s CERN made protocol born from the GBTx on-
detector protocol chip;
• FULL Mode, a 9.6 Gb/s custom made protocol developed inside the collabo-
ration to achieve a bandwidth near to 10 Gb/s;
• the PCI Express, a proprietary very versatile protocol.
In more details, the PCIe is built partially with a hardware core in the FPGA
fabric. The data coding is the standard 128b/130b, a very efficient code with a
payload (percentage of bits used for the actual data, excluding the bits related to
the managing) larger than 99 % (given by 128 bits over 130 bits). Today the FPGA
3.4. FELIX READ-OUT UPGRADE 37
Figure 3.9: FLX-712 firmware scheme.
38 CHAPTER 3. NEW DETECTOR FEATURES FOR RUN 3
fabric implements two 8 lanes PCIe cores, which justifies the necessity of the PCIe
switch to control concurrently the two cores and consequently all the PCIe capability.
The FLX-712 Phase-I firmware is a double architecture. Starting from the left
side of Figure 3.9, there’s the front-end protocols wrapper, which implements the
data format and configures the high-performance transceivers. This block has a
two-sided bidirectional I/O ports designed to connect the data/configuration pa-
rameters from/to the front-end with input/output of the system central block called
Central Router. The Central Router applies the synchronization and buffering pro-
cedures needed to put in communication two different data bandwidth (FULL Mode
or GBT FPGA with the PCIe) by instantiating a First-In First-Out (FIFO) struc-
ture. Besides this, the Central Router directs the data from and to all the channels
for both sides (high-speed I/O and PCIe I/O).The two twin Central Routers com-
municate between them and are the only logic blocks driven by the TTC mezzanine.
Concluding the description, the right block in Figure 3.9 represents the logic com-
municating directly with the PCIe fabric inside the FPGA. By contrast the custom
developed core called Wupper is a Direct Memory Access (DMA) based PCIe struc-
ture which uses user-logic FIFO as buffer memory. Wupper is made by a DMA
Control monitoring and directing the DMA descriptors and by a DMA Read and
Write core acquiring and transmitting the data stream in both verses. The FELIX
software suit block is shown in Figure 3.10 . Its structure is composed of four cores:
Figure 3.10: Scheme of the FELIX environment from firmware (on the left) to the software.
• FELIX Low Level software: this is the architecture connecting the FELIX
cards inserted in the PCIe connectors to the OS. Using custom PCIe designed
drivers developed by the FELIX group it is possible to test and configure the
FPGA firmware;
• FELIX Star: this is the central application of the FELIX system, managing
the data transfer between one or more cards and the NetIO clients. It is a
set of implemented software functions such as: a packet forwarding from/to
the DAQ system to/from the front-ends, a tool to configure FELIX cards and
3.4. FELIX READ-OUT UPGRADE 39
GBT E-Links, a tool to gather statistics and performance, a tool to report
operational status of the detector links and implement errors recovering;
• FELIX Bus: this core supports the communications between instances as FE-
LIX Star and their clients, for example the Software ROD (SwROD). FE-
LIX Bus can have any number of clients and it’s used to publish information
keeping all the subscribers updated to new communications and in case of
un-subscriptions. It also distributes tables associating the E-Link with the
host/port, as well as tables for monitoring purpose;
• NetIO: this is a message-based networking library functioning as the FELIX
network stack, used for typical use cases in DAQ systems.
40 CHAPTER 3. NEW DETECTOR FEATURES FOR RUN 3
Chapter 4
FLX-712 Commissioning
[21][41]The number of FLX-712 cards produced so far to fulfill the Phase-I and
Phase-II plans is 254. This amount of hardware has been produced by a private
company to delegate to a contractor the big amount and type of high-performance
instrumentation, operations and quality control tests to be performed. The produc-
tion campaign of the new readout cards began on summer 2019, but the preparation
of the board acceptance and qualification tests started over a year and half earlier.
The following pages describe the reasons, the plans, the setup and the conclusion of
the verification task. My personal contribution was to prepare the software tests as
part of the whole suit and to control the hardware performance of the brand new
FLX-712s. The quality checks covered all the hardware components of the FLX-712
from the PCB to the individual components, the soldering and mechanical joints. I
also tested the functionality and performance of the whole card, including the long
time emulated acquisition runs before and after to put the cards under extreme
environmental conditions.
4.1 Tests Plans
The FLX-712 will have to run throughout all the Run 3 until the next upgrade. The
connectivity of the front-end to the back-end of ATLAS is also built by considering
the eventuality of problematic or unusable lanes. The necessity to replace one or
more boards would require a stop in the data acquisition from that specific portion
of the detector. The environmental parameters, such as the temperature and power
voltage, will be monitored, to avoid possible failures causing some FLX-712s to
break. So, by considering many of the cases that could happen, is fundamental being
able to certify the quality of the hardware. The cards will also be used for testing the
FELIX environment with the Phase-II new detectors. For this reason the cards have
been tested to assure their stability and functioning for at least 5 years, possibly
extended to 10. A particular care has been reserved to the performance of the high-
end components such as the transceivers. For these components I have monitored the
data flow during all the working time of the cards to ensure a very high performance.
41
42 CHAPTER 4. FLX-712 COMMISSIONING
I also have looked at the Bit Error Rate (BER), which represents the amount of
erroneous bits divided by the number of total bits occurred during a data stream. In
addition to this, the stability of the power supply, the accuracy of the clock sources
and the functionality of the slow control architecture are crucial. Reliability and
reproducibility of the FELIX performance were also monitored. The specifications
validated by the contractor to first estimate the quality of the production were
extended at CERN for product acceptance. These checks are described below.
• The PCB (Printed Circuit Board) quality needs to be set at the level of several
standards: NADCAP (National Aerospace and Defense Contractors Accredi-
tation Program) for the production, several IPC (Institute of Printed Circuit)
and the Cleanliness Designator C-22 for the assembly and cleanliness, IPC and
PCI-SIG PCI Express standards in the end for the finished product, which
must pass and be certificated with a IPCs certification.
• The manufacturer must send the components and the PCB with the required
certification attesting the standard controls.
• The storage of PCB and components must conform to strict standards which
also specify the maximum time between the production and the soldering, in
order to ensure their correct functioning after assembly.
• During the whole assembly and testing campaign, all electronic devices must go
through operations with static and dynamic temperature conditions (baking
and thermal cycles). The contractor must assure: the dehumidification of
the devices with static temperature, the soldering of the components to the
PCB using specific dynamic temperature curves defined by the specification,
the testing of the hardware stability at high temperature increasing it of 10-
15◦ C/min reaching 100-110 ◦ C. High temperature thermal cycles must also
be applied between two or more functional tests, for example an eight hours
data acquisition run, to check if the high temperature changes the device
performance (this was done at CERN).
• X-ray inspection is performed by the contractor to study the quality of the
PCB before and after the mounting operations. In particular this action is
used to detect soldering defect especially for components, such as the FPGA,
in a BGA (Ball Grid Array) package whose pins are underneath the component
body and whose solder joints cannot be checked by the Automated Optical
Inspection (AOI) tool.
• The contractor checks include the controls of the individual components and
of the card connections before and after the mounting, before and after the
first power-on. For example, it is important to check that specific pins of the
level-shifters devices on-board are grounded or not. Also passive controls of
current, voltage and resistance had to be done on the bare PCB. There are
4.1. TESTS PLANS 43
two methods to carry out these tests: manually by the company operators or
using an instrument called Flying Probe, which acts on the bare PCB.
• Full Automated Optical Inspection (AOI) of the boards, automated control
looking for catastrophic failure (missing components, insufficient solder joints,
ect).
• Solder Paste Inspection must be carried out too.
• Bit Error Rate (BER) and eye-diagrams, performance tests of the high speed
FPGA lanes must be run at the end of the previous tests.
• General functionality of the FLX-712 using a FELiX server and the FELiX
Low Level software must be performed as the last step.
The two latter were prepared by the FELiX group, including myself, to ensure
the basic functionality of the complete FLX-712 device immediately after the ver-
ifying that the card would be power up without issues. A manual of these two
tests, including the FLX-712 mechanical assembly procedure and the passive com-
ponents checks, has been prepared by me for the contractor. The BER has been
designed to check the most performing technology of the XILINX Ultrascale FPGA,
the transceivers. These are an Intellectual Property (IP) hardware components com-
posed of many internal blocks as Phase-Locked-Loops, serializers and de-serializers,
and others designed to reach over 16 Gb/s of data rate in the case of the FLX-712
FPGA. A scheme of this architecture is shown in Figure 4.1 , where PLL repre-
sents Phase Locked Loop, CPLL is Channel PLL, related to a single transceiver
lane and QPLL represents Quad PLL, managing the clocking of an entire quadru-
plet of transceivers. These high-speed lanes have been tested with two methods,
as mentioned above: BER and eye diagram. The first consists on programming
the FPGA to implement a data transmission in loop-back mode (where the data
transmitted are received by the same FPGA) and checking out errors over the total
bits sent. The GUI of the tool used (Vivado) to measure the BER of the serial
links under test is shown in Figure 4.2 . However the BER test doesn’t give enough
information on the performance of the serializers. In fact, the eye-diagram of each
lane (24-48 channels per card or 24-48 lanes per board, for a total of about 254
FLX-712s) has been separately performed. The eye-diagram is a procedure to per-
form tests on the quality of a serial lane by studying the shape of the logic 0 and
1 of the data flow, usually using an oscilloscope. It consists of overlapping the 1
and 0 logic transitions and ”measure” how much the ”eye” of the data stream is
opened. Figure 4.3 shows examples of bad and good eye-diagrams, by focusing on
the wideness of the eye. In the case of XILINX technology, this test is hardwired
in the FPGA and can be exploited using the Vivado tool. The diagram in Figure
4.4 provides the user with two information of quality: the shape of the eye and
how stable it is in the various regions (raising-edge, falling edge, middle of the data,
ect). The shape is expected to be vertically symmetric (up and down), it should be
44 CHAPTER 4. FLX-712 COMMISSIONING
Figure 4.1: Scheme of the transceivers architecture, showing the interconnections between
four lanes.
4.1. TESTS PLANS 45
Figure 4.2: Screenshot from Vivado tool of the test preparation results, here showing the
interface to study the BER of the transceivers lanes. The first and second columns show
the names of the loop-back links, the third (green) the data throughput, the fourth the
total bits sent, the fifth the amount of erroneous bits, the sixth the BER value. ”No Link”
is an hardware type error occurring when the cable connection is not functioning well.
Figure 4.3: Examples of a good (up) and bad (down) eye diagram from an oscilloscope.
46 CHAPTER 4. FLX-712 COMMISSIONING
Figure 4.4: Image of one of the eye diagram generated by Vivado during the tests prepa-
ration. The units translation is 2 mV for 1 Voltage (Codes) and 1 bandwidth period for 1
Unit Interval (in this case 1 UI = 104 ps). The color legend of the BER value in a specific
spot Voltage/time goes from dark blue (10−9) to dark red (3.6 x 10−1).
wide open, with the right side more stretched because of the electronic required by
the receiver to build the data to draw the eye diagram, shown in Figure 4.5 (and
explained in more depth in Appendix D). It is possible to choose the granularity of
the eye and calculate, in terms of BER, the stability of each point with coordinate
time over voltage (as an oscilloscope would do). The BER values targeted for the
eye quality are 10−5 for the first start-up of the card, to have quickly a first view
of the high-speed transceivers quality, and 10−9 for a deeper investigation. Figure
4.4 describes the eye-diagram BER too. The accepted opening of the eye for the
diagram is an eye size (in units related to the tool) of 6300 at least. Figure 4.6 shows
the Vivado GUI showing it. In conclusion, the high-speed lane tests are also useful
to test the components connected to the FPGA transceiver pins as the MiniPODs
(this through on-board optical fibers to the MTP connectors). A suit of functional
tests to measure the FPGA performance and basic functionalities has been created,
using a FELiX server.
• FPGA and FLASH memory programming validation is performed by checking
the channels used to communicate with the FPGA. These channels are the
JTAG chain, which connects the FPGA internal programming system to the
Boundary Scan. Via the JTAG protocol connector the FPGA is connected to
the outside and via specific channels it is linked to the FLASH memory, that
is a non-volatile memory able to program the FPGA automatically at power-
on. The firmware used in this test connects the card to the FELiX software,
enabling the possibility to program the FLASH memory also by software.
These were done before the BER test.
• micro-controller (mc) programming is done with the FLASH memory on-board
4.1. TESTS PLANS 47
Figure 4.5: Overview of the hardware and firmware structure implementing the eye dia-
gram check in Vivado. CTLE = Continuous Time Line Equalizer, DFE =Decision Feed-
back Equalizer. These two blocks are used to equalize the received signal (reduce the
attenuation and distortion of it) and to have less channel loss (8 db at Nyquist frequency).
Figure 4.6: Screenshot of the Vivado tool showing the results of the tests preparation.
Here it is shown the interface of the results of the eye diagram test, with the name of the
links in the first column, then going to the right the status and progress of the test, the
value of the open area and the configuration parameters.
48 CHAPTER 4. FLX-712 COMMISSIONING
controlled by a ATmega324A micro-controller, which manages the partitions
of the FLASH memory to program the FPGA. This mc is programmed by a
connector on-board, and it is necessary to the FELiX firmware system. This
programming was done before the BER test.
• After performing the BER test, the software FELiX is used to communicate
via the PCIe with the card and to test the high-level functionalities such as
monitoring, data transmission test and configuration.
The functional tests purpose is to demonstrate that all the basic commands
needed to control the cards through the FELiX Low Level software work. These
tests are listed below:
• card recognition, to be sure that the firmware runs well and the card can
communicate with the software suit;
• monitoring of the card status including: current, temperature, voltage, channel
status, parameters and general information of the running firmware as version,
release and specific features;
• configuration of the MiniPODs: the opto-electrical transducers can be config-
ured via I2C protocol from the FPGA, a necessary operation to achieve the
best performance of the optical fiber transmission;
• FLASH memory partitions programming: the FLX-712 card works on different
situations and detectors. For this the FLASH memory has been split in four
partitions to have enough possibilities to save in all the cards the required
firmware and remotely select one out of four;
• FULLMode throughput test: to ensure the minimum required performance of
the PCIe communication, a FULLMode protocol transmission from the card
to the server is done. The test is considered successful with a data speed of at
least 3500 MB/s for each of the two endpoints, needing in total 7 GB/s data
stream;
• GBT performance: the data protocol used to configure the front-end chips and
also for detector data reception will be the CERN protocol GBT FPGA. This
test studies the performance of a loop-back GBT transmission and reception.
As last validation test to do by the contractor, a BER check is done to all the
lanes to assure that they can reach 10−13. The BER, eye diagram and functional tests
have been prepared along with the FELiX developers. In particular this latter has
been my original contribution to the qualification task for the ATLAS experiment.
Figures 4.7, 4.8, 4.9 and 4.10 show the screen-shots of the tests outputs in case
of successful validation of FLX-712 (the green ”SUCCESS” or ”OK”). When the
production and tests of the cards finished in the contractor place, the boards were
sent to CERN for the successive acceptance tests. These checks include:
4.1. TESTS PLANS 49
Figure 4.7: Screenshot of the results of the tests on the FLX-712 pre-series production.
Here is shown the recognition of the firmware by the searching for the cards in the PCI
bus of the FELIX server.
Figure 4.8: Screenshot of the results of the tests on the FLX-712 pre-series production.
Here is shown the results of the throughput tests of the FULLMode protocol, with all the
actions done by the server during the test and with the outcome of the operation showed
in green (SUCCESS), or in red (FAILURE). Because the test results was mostly driven
by the firmware more than the hardware, one SUCCESS was considered enough to have
the test passed.
50 CHAPTER 4. FLX-712 COMMISSIONING
Figure 4.9: Screenshot of the results of the tests on the FLX-712 pre-series production.
Here is shown the results of the loop-back GBT protocol stream, to test the LTDB mode
of the card. The all green ”OK” shows the good outcome of the test.
Figure 4.10: Picture taken during the FLX-712 pre-series tests showing the start of the
check chain, with the recognition of the firmware cards, the monitoring of voltage, current
and temperature of the FPGA and the start of the Flash memory programming, the latter
showing that the first partition was successfully programmed.
4.2. COMMISSIONING STATUS 51
• A visual check to exclude any possible damage during the delivery. The FLX-
712 have been visually checked also to verify the correct mounting of the eight
optical fibers of the MiniPODs. We also checked the cleanliness of the cards,
the absence of broken components, tilting of the card (recurrent issue, mainly
due to the complex stack-up structure), correct pad placing on the bare PCB
and in general the manufacture quality.
• A jitter control of the clock sources on board is fundamental considering the
structure of the ATLAS general clock distribution system which require the
sub-detectors to generate local low-jitter ( < 10 ps) clocks from the 40 MHz
TTC clock.
• A check of the propagation time of the L1 Accept trigger signal from the
FLX-712 to the front-end must be kept constant around 500 ns.
• A validation of the data acquisition capability via the data stream stability,
a loop-back transmission/reception session of eight hours implemented using
the FELiX servers of the same type which will operate on the Run 3 TDAQ.
• A check of the ”busy” signal from the FLX-712 signaling buffer saturation.
• A last session of thermal cycles at 110◦ C with temperature rising in 10-
15◦/min.
All these tests were performed in different locations and using many types of
instrumentation. From baking oven to AOI machines, to a simple multimeter. The
specific setup implemented for the functional tests was a FELiX server customized
to let it host and test four FLX-712 cards concurrently, with double physical CPUs.
The image of the functionality, BER and eye-diagram test setup is shown in Figure
4.11 .
4.2 Commissioning Status
During the draft of this thesis the total number of FLX-712s produced has been
254. All of them underwent the acceptance tests. Only two FLX-712s are consid-
ered faulty, with checks undergoing. In all the commissioning, apart from small
issues quickly solved, the most critical problem occurred was the warping of the
first version of the PCB detected in the first batch of 20 cards produced. This
occurred during the reflow operations (sticking the components on the board with
a solder paste to solder using a temperature-under-control oven). The warping of
a PCB can be due to a poor design without considering all the needed electrical
and mechanical components to mount on the board, the FPGA position which is
based on the PCI-SIG and IPC requirements, the equalization and distribution of
the materials. The definitions of warping of a PCB are shown in Figure 4.12 ,
where the maximum allowable percentage of these erroneous parameters is 0.75 %
52 CHAPTER 4. FLX-712 COMMISSIONING
Figure 4.11: Picture of the test setup developed for the FLX-712 production, used by the
production contractor. 1) JTAG cable (to communicate with the FPGA), 2) FLX-712, 3)
server motherboard, 4) optical cable bounding 48 channels using 48 optical fibers, 5) 4U
PC case.
4.2. COMMISSIONING STATUS 53
Figure 4.12: Visual description of bowing (or warping) and twisting of a PCB, from IPC
methodology manuals.
54 CHAPTER 4. FLX-712 COMMISSIONING
([31] for the correct measurements methodology) for the case of PCB designed for
Surface Mounting Device (SMD). The design issues of the FLX-712 were probably
due to not perfect balance of the PCB metallic materials, which created tension
in the PCB. Furthermore, the FPGA position caused the larger BGA to ”absorb”
that tension and counter balance it, corresponding to a PCB shape as shown in Fig-
ure 4.13 . Moreover, the FPGA solder-balls, due to the increased tension, resulted
Figure 4.13: Visual description of the bowing shape of the FLX-712 before the solution
adopted to remove the warping and twisting.
in a bad shaped structure (spherical instead of elliptical, as shown in Figure 4.14
). Other than these, the stiffener plate designed to stabilize the card and keep it
Figure 4.14: Visual representation of the solder balls between FPGA and PCB, with the
shape shown in the upper part occurred before the warping solution and the shape in the
lower part after the solution adopted.
planar was affected by bowing too, creating additional tension. The PCB warping
issue was solved by using the stiffener plate during the reflow operation, avoiding
to further bowing and twisting the PCB so keeping the planar shape of the card in
the specifications. Also the stiffener plate used in the final version is of a different
manufacture. Another issue occurred during the first batch tests was the automatic
snapping-out of the most tense on-board optical cables, which are connected to the
MiniPODs by a plastic interlocking. Due to mechanical design some interlocking
could snap-out, this automatically or because of a light tension. This was due to the
too high tension of the cable. The issue was solved by inverting the position of the
MTP connector to increase the cable freedom of movement and reduce its tension.
Generally speaking, during the FELIX production, the communication between the
hardware developers from BNL, NIKHEF, CERN and the manufacturers of PCB
4.2. COMMISSIONING STATUS 55
production and card assembly were based on periodic reports. A lesson learned was
that the exchange of technological knowledge in these type of productions is very
important and can improve the performance and the durability of the hardware.







The next ten years will be crucial for the HEP studies at CERN. After the con-
clusion of Run 3, the so-called Long Shutdown 3 (LS3) will look on moving the
today technologies of the LHC accelerator and related experiments a step forward,
powering them to expand their research area for the physics beyond the Standard
Model, to refine the knowledge of the electro-weak symmetry breaking, etc. The
ATLAS Phase-II upgrade will involve all the parts of the detector, the TDAQ, the
construction of new instrumentation and the implementation of new strategies. In
the next chapter an overview of the main changes that will characterize ATLAS Run
4 in the second half of this decade will be shown. Then we will focus on the upgrade
of the TDAQ system and the new strategies planned to be adopted. Some of these
methodologies are still under test to prove their validity and their performance. The
target of this second Part is to describe the work behind the proposal for a FPGA
implemented tracking algorithm for the Phase-II ATLAS TDAQ system, specifically
the hardware structure called Hardware Tracking for the Trigger (HTT). The pro-
posed tracking algorithm is a tuned version of the so-called ”Hough Transform” [16],
a straight line searching methodology.
5.1 New Goals of the Experiment
The end of Run 3 is planned for 2024. In this Run the performance of LHC will be
pushed to and beyond the structural limits, in terms of peak of luminosity, pile-up,
p-p collision energy, etc, as shown in Section 1.3. The LS3 targets to step forward in
sensors, hardware, firmware, software and strategies to reach values of parameters
higher even of one order of magnitude, for example the total integrated luminosity
planned to reach at least 3000 fb−1 with respect to the 300 fb−1 of Run 3. The
ATLAS Phase-II physics research objectives will include:
• Precision measurements of the properties of the Higgs Boson, for example the
coupling of fermions or self-coupling;
• Precision standard model measurements as for top mass and cross-section;
59
60 CHAPTER 5. NEXT ATLAS DETECTOR
• Searches for Beyond Standard Model as Super Symmetry or long-lived parti-
cles;
• Flavour physics as rare B-meson decay;
• Heavy-Ion Physics.
The detecting strategy will be the same as today experiment, with the same
type of sub-detectors placed at the same distance to the interaction point. The
|η| coverage will slightly change as: the most inner detector (Inner Tracker) will
sample data from |η| = 4, the Muon Spectrometer will receive new RPC allowing
to reach |η| < 1 and the new High Granularity Timing Detector is planned to cover
2.4 < η < 4.0.
5.2 Inner Tracker
The core of the next ATLAS detector will be the Inner Tracker (ITk), a double
detector with the same purpose as the current Inner Detector. ITk has been designed
to give high track reconstruction efficiency and a low rate of fake tracks, meaning:
for muons above 3 GeV an over 99 % of efficiency, for pions and electrons above
1 GeV and out of 2.7 η an efficiency over 85 % while keeping fake rates below 1
%. The performance are planned to be robust against a 10 % loss of channels or
modules. The ATLAS core scheme is shown in Figure 5.1 , while Figure 5.2 shows the
Figure 5.1: Scheme of the ITk detector built by software simulation.
schematic layout of the silicon-only detector. It will apply an ”inclined layout” which
allows concurrently to stand a pileup = 200 and |η| = 4, while maintaining a good
track efficiency. Starting from the outside the Strip Detector will have four barrel
5.2. INNER TRACKER 61
Figure 5.2: Scheme of the active areas of ITk, in ATLAS coordinate system, with in the
vertical axis the radius from the beam pipe and in the horizontal axis the Z coordinate
parallel to the beam line.
layers and six petal-designed end-cap disks covering |η| < 2.7. The Pixel Detector
will have five flat layers, five inclined layers and five end-cap layers, together allowing
|η| < 4.
5.2.1 Pixel Detector
The idea of ”Hybrid Pixel Detector” exploited with the Insertable B-Layer upgrade
of Phase-0 will be maintained in the next ATLAS Pixel Detector [13]. An In-
stantaneous Luminosity of 7.5 ·1034cm−2s−1, letting a 3000/4000 fb−1 of integrated
luminosity with a pile-up of 200, are the parameters that drive the new PD design.
Starting from these numbers, the radiation tolerance of the inner technology should
achieve a resistance of 9.9 MGy. The architecture strategy will be the same as the
Insertable B-Layer, with the inner-most mechanical structure to be considered not
necessary to support the beam pipe, so that the outer most section and the pixel
package should not rely on it. The three outer PD layers, the barrel and the end
caps have demonstrated to be able to achieve 4 MHz of data events, while the two
inner most PD layers have been limited to 1 MHz because of interconnections. The
architecture of this sub-detector is designed following an ”Inclined Duals” layout,
with 3 regions: a flat barrel, an inclined section at more or less the same radius of
the flat one and an end-cap region. Tables 5.1, 5.2 and 5.3 list the main geometric
parameters. The ”Inclined Duals” layout will allow 9 pixel hits for |η| > 2.7, an
active size of the pixel read-out chip of 19.2 mm x 20 mm and will require a total
pixel surface of 12.74 m2. The sub-detector modules (”hybrid pixel module”) have
been organized in a passive high resistivity silicon sensor, read out using a CMOS
technology fabricated front-end chip. The sensor layer and the readout one are called
bare-module, mounted on a flexible PCB named module flex. The module flex is
62 CHAPTER 5. NEXT ATLAS DETECTOR
Barrel Layer Radius [mm] Rows of sensors Sensors per Row Type Hits
0 39 16 6 duals 1
1 99 20 6 quads 1
2 160 30 11 quads 1
3 220 40 12 quads 1
4 279 50 13 quads 1
Table 5.1: Configuration of the barrel layer of the Phase-II Pixel Detector.
Barrel Layer Radius [mm] Sensors per Row Angle [deg] Type Hits
0 36 16 75 singles 2-3
1 80 13 75 quads 2-3
2 155 11 56 quads 1
3 215 13 56 quads 1
4 274 13 56 quads 1
Table 5.2: Configuration of the inclined barrel region of the Phase-II Pixel Detector.
End-cap Layer Radius [mm] Rings Sensors per Ring Type Hits
0 50 4 16 quads 3
1 78 11 22 quads 3-4
2 152 10 32 quads 2
3 211 8 44 quads 1
4 271 9 52 quads 1
Table 5.3: Configuration of the end-cap region of the Phase-II Pixel Detector.
5.2. INNER TRACKER 63
the connection between the active elements (shown in red in Figure 5.2) and the
bare modules, where the flex is glued to the elements. The module flex is planned
to interface with the active elements via dedicated commands and signals such as
clock, command inputs, data output, low and high voltage power. The individual
pixel size will be 50 x 50 µm2 or 25 x 100 µm2 to improve the resolution. The
read-out chips should reach a radiation tolerance of 1.4 x 1016neq/cm2 to stand the
radiation damage. The high pixel rate in the inner most detector justifies the output
bandwidth up to 5.12 Gb/s per front-end chip. A serial powering scheme will be
adopted to reduce the number of cables and the amount of inactive material required
in the tracker. There will be 3 types of configurations for the modules: single, dual
and quad, respectively with one, two and four front-end chips bump-bounded to a
single sensor.
The two pixel architectures in R&D for the Pixel Detector are made by 3D and
planar pixel sensor. The first with thickness of 230 µm will be exploited in the
inner most region due to the high radiation to withstand of 9 x 1015neq/cm2. An
important goal reached thanks to its structure was a hit reconstruction of 97 % with
a power dissipation of 10 nW/m2, irradiated with 24 GeV protons at up to 1.4 x
1016neq/cm2. All these and other improvements are expected thanks to the smaller
electrode distance and consequent less trapping. The planar sensor designed is an
n-in-p configuration with a hit efficiency achieved of 97.5 % with thickness of 100 µm
and a resistance to radiation up to 1016neq/cm2. Important strategies were adopted
for this technology as control of leakage current, reduction of sensor thickness and
extension of the depletion region up to the edges, this by using implanted and
diffused vertical sides. Figure 5.3 shows a graphic representation of the 3D and
Figure 5.3: Visual description of the 3D (on the left) and planar (on the right) pixel sensor
technologies.
planar technologies, where the left one differs from the planar by the collecting
columns and the electrodes perpendicularly placed with respect to the surface. This
permits a good charge collection even in high radiation area. About the front-end
CMOS technology, 65 nm has been targeted with a radiation tolerance needed of 5
64 CHAPTER 5. NEXT ATLAS DETECTOR
MGy for 4000 fb−1 of total absorbed dose, with a limit required at almost 10 MGy for
2000 fb−1 for the inner replaceable layers. We assume that during the entire Phase
II campaign there will be at least a replacement of the two inner layers. Moreover
the front-end will have to stand a high single event upset resistance. Table 5.4 lists
Layer/Ring Data rate (1 MHzL0)(Gb/s)
Data Rate (4 MHz
L0)(Gb/s)
Design data rate
per FE chip (Gb/s)
Layer 0 3.97 - 5.12
Layer 1 0.89 - 2.56
Layer 2 0.52 2.08 5.12
Layer 3 0.32 1.28 2.56
Layer 4 0.22 0.88 1.28
Ring 0 2.15 - 5.12
Ring 1 1.07 - 2.56
Ring 2 0.65 2.60 5.12
Ring 3 0.39 1.56 2.56
Ring 4 0.27 1.04 1.28
Table 5.4: List of the data rates of the Phase-II Pixel Detector for some layers and rings.
the data rates for the Pixel Detector for 1 and 4 MHz Level 0 trigger acceptance
rate.
The RD53 collaboration is a project forwarding the ATLAS and CMS Pixel
Detectors upgrades. It is working on the R&D for the front-end chip requirements
and has already produced the RD53A in small quantity. The RD53B sensor is still
under development with libraries to be applied to ATLAS or CMS physics case.
The RD53A, schematized in Figure 5.4 , is mounting now with three technologies of
Figure 5.4: RD53A conceptual scheme.
5.2. INNER TRACKER 65
front-end amplifiers under characterization, and it will communicate with a custom
Aurora 64b/66b protocol on four lanes at 1.28 Gb/s each. A value larger than 100
hits/25 ns was the target. Figure 5.5 shows the floorplan of the functional view of the
Figure 5.5: RD53B conceptual scheme.
RD53B. To communicate off-chip the front-ends of all the PD will use optical fibers
and two chips: an Aggregator and an Equalizer. The direct output connection of the
front-end sensors will be electrical because more radiation tolerant. This connection
will go to the opto-electrical board that will end/start (transmission and reception)
to the future FELiX DAQ cards, starting the TDAQ chain. Figures 5.6 and 5.7
Figure 5.6: Logical scheme of the interconnection between front-end in two mode, single
module and quad module, with the aggregator and the off-detector, for the Phase-II Pixel
Detector.
show, respectively, the front-end (FEs) output data merging and configuration to-FE
66 CHAPTER 5. NEXT ATLAS DETECTOR
Figure 5.7: Logical scheme of the interconnections between the off-detector and the module
configuration and clock system, for the Phase-II pixel Detector.
links. The chips configuration and the clock will be streamed by the new low-power
GigaBit Transmission (lpGBTx) chip.
5.2.2 Strip Detector
Together with the Pixel Detector, the other type of infrastructure that will be ex-
ploited in the ATLAS inner most detector is the Strip Detector [8]. A 165 m2 of
active silicon sensor area will be connected to the front-end micro-electronics. This
will be formed by several rows of strips with a pitch at 75.5 µm in the barrel region
and in the disks region from 69 µm to 85 µm. Roughly 60 millions electronics chan-
nels will acquire and transmit data via the 14 modules per stave per side. The disks
is planned to have 9 modules in petal disposition. The sensing elements designed
are high resistivity n-in-p planar silicon strips required to withstand fluence of 1.2
x 1015neq/cm2 and a total ionizing dose of 50 MRad. The front-end technology tar-
geted is a CMOS 130 nm, based on the same architectures of the Inner Detector. In
particular the chips ATLAS Binary Chip (ABCStar) and Hybrid Controlled Chip
(HCCStar) will be used. Figure 5.8 shows a scheme of the strip module. The petals
and staves will talk to the off-detector area via optical fiber driven by the low-power
GigaBit Transceiver x (lpGBTx) radiation hard CERN ASIC, a GBTx evolution
capable to reach 10.24 Gb/s.
5.3 High Granularity Timing Detector
Figure 5.9 indicates where the High Granularity Timing Detector (HGTD) [14]
is intended to be placed. This new architecture has the fundamental purpose of
increasing the luminosity measurements precision. Thanks to the strategic position
of this detector, it will be possible to measure both online luminosity bunch-per-
bunch during HL-LHC running and enhance the high precision sampling of the
integrated luminosity. This increase in the luminosity resolution is decisive for the
Higgs couplings survey. The HGTD will augment the spatial and time performance
5.3. HIGH GRANULARITY TIMING DETECTOR 67
Figure 5.8: 3D view of the Phase-II Strip Detector module.
Figure 5.9: 3D view of the new HGTD detector with its position in the future ATLAS
structure.
68 CHAPTER 5. NEXT ATLAS DETECTOR
of ITk with a 30 ps time resolution for the minimum ionizing particle going through
the inner most detector. A 50 mm moderator is planned to be placed between the
HGTD and the end-cap/forward calorimeters region to protect HGTD and ITK from
back-scattered neutrons. The front-end custom ASIC ALTIROC is being developed
and, by today plans, will be bump-bonded to the silicon sensor now in R&D status.
This ASIC will provides high time and spatial resolution, radiation hardness, and
important operations such as:
• counting the number of hits registered in the sensor;
• 40 MHz transmission to allow unbiased, bunch-per-bunch measurements of the
luminosity;
• coping with the minimum-bias trigger.
The HGTD end-cap will integrate: one hermetic vessel, two instrumented double-
sided layers mounted on two cooling/support disks and two moderator pieces in-
ternally and externally the hermetic vessel. The HGTD detecting region will cover
2.4 < |η| < 4.0. Figure 5.10 summarizes the HGTD main parameters. The sensors
Figure 5.10: HGTD main geometrical parameters, hits per track and time resolution.
designed are a pioneering technology developed in Barcelona (Centro Nacional de
Microelectronica) called Low Gain Avalance silicon Detector (LGAD), a n-on-p sili-
con sensor with extra-dope p-layer below the n-p junction to create high field to get
an internal amplification. Figure 5.11 shows a scheme of the modules planned for
the detector.
5.4. CALORIMETER 69
Figure 5.11: 3D view of the HGTD module.
5.4 Calorimeter
The Liquid Argon Calorimeter (LAr) [10] will not undergo into a deep upgrade. By
contrast the infrastructures as the read-out electronics and the low-voltage power-
ing system will be updated to overtake the technology limitation and obsolescence.
Along with the upgrades of Phase-I, the LAr will gain readout performance effi-
ciency from new boards in the on-detector region: the FEB2 (Front-End Board 2)
and Calibration Board, which respectively will manage the analog processing and
will inject calibration signals. On the off-detector side, a new board LAr Signal
Processor Board (LSPB) will be used to transmit data to the DAQ structure. Its
purpose will be to digitize FEB2s information and to apply digital filtering to the
signals of each LAr calorimeter cells. The technology used will be custom ASIC for
the FEB and FPGA for the LSPB.
The Tile Calorimeter (TileCal) [11] will be the central region of the hadronic
calorimeter, and its position and role will be the same as in the previous runs.
Figure 5.12 shows the scheme and position the Tile Calorimeter in the ATLAS
Phase-II scenario. The sub-detector will capture roughly 30 % of the jet energy
and will always be of crucial relevance in jet and missing energy measurements,
jet substructure, electron isolation and triggering. The TileCal is built with lead
absorbers and 460’000 plastic scintillator plates and read out by wavelength-shifting
fibers. The fibers are bundled in cells and read out by photo-multiplier tubes, which
extract data by the 4670 cells two at a time. The detector is separated in three
sectors, ”A”,”BC” and ”D” with respectively 1.4, 3.9 and 1.8 interaction lengths at
η = 0. The η x φ granularity is 0.1 x 0.1 approximately.
70 CHAPTER 5. NEXT ATLAS DETECTOR
Figure 5.12: Scheme showing the Calorimeter system in the ATLAS Phase-II detector.
LB = Long Barrel, EB = Extended Barrel, both divided in A and C.
5.5 Muon Spectrometer
The Muon Spectrometer (MS) [9] of today experiment achieved relevant performance
in muon identification, tracking and momentum resolution. MS Phase-II aims to
preserve these levels of efficiency in higher luminosity conditions. This by designing a
more selective trigger and keeping the low pt thresholds necessary for many physics
channels. Another expansion of research could be the study of muons at large
pseudorapidity (< 4), possible with the grown ITk coverage. These will be reached
by upgrading almost all the infrastructure into something similar to the scheme in
Figure 5.13 . From the trigger and read-out side, new technologies and algorithms
will be used to re-design completely the trigger and read-out of RPC and TGC,
for example, other than MicroMegas (MM) and sTGCs. The trigger primitives will
also acquire data from the TGC hits to provide the end-cap trigger candidates. The
MDT electronics will undergo a complete changing to be able to sharpen the turn-
on curves of the high-pt triggers, so to reduce the low-pt passing the selection. The
MDT precision coordinates will be used in the Level-0 trigger to improve the TGC,
NSW and RPC trigger candidates quality. This will be possible by a new MDT
read-out. The NSW read-out should support a higher (1 MHz) rate at Level-0,
requiring an upgrade in the future. At detector level MDT, RPC and TGC will
follow several technology and parameter changing.
5.5. MUON SPECTROMETER 71
Figure 5.13: Scheme of the Phase-II MS active areas and layout.
72 CHAPTER 5. NEXT ATLAS DETECTOR
Chapter 6
TDAQ
The ATLAS Phase-II TDAQ [12] will be designed for a HL-LHC configuration capa-
ble of up to 3000/4000 fb−1 integrated luminosity at the end of Run 4. The calorime-
ters increased granularity, which scopes an efficiency improvement for muon-based
triggers, together with the extended coverage of ITk coupled with the hardware-
based tracking, will allow a factor of 10 higher trigger rates with respect to Run 3.
The TDAQ structure has three architectures under study, to find the best solution:
”baseline” scenario, ”evolved” and ”variant”.
6.1 TDAQ Baseline
Figure 6.1 shows the ”baseline” scenario scheme. Three levels are used in this
solution:
• the Level-0 trigger System (L0), shown in Figure 6.2 . The trigger sub-systems
are composed of the L0 Calorimeter Trigger (L0Calo), the L0 Muon Trigger
(L0Muon), the Global Trigger and the Central Trigger Processors (CTP). The
L0Calo is composed of the electron Feature EXtractor (eFEX), the jet Feature
EXtractor (jFEX) and the global Feature EXtractor (gFEX) complemented
with the forward Feature Extractor (fFEX). These electronic cards represent
the hardware for electron-based, jet-based, large R-based and again electron-
based triggers, with the latter for forward region of 3.2 < |η| < 4.0. The
L0Muon sub-system comprises the barrel region of RPC and end-cap of TGC,
the NSW and MDT trigger processors. The Global Trigger uses the high
granularity calorimeter to perform offline algorithms, refining the L0Calo and
L0Muon information, calculate event-level values and cope with the topological
algorithms usage. The CTP drives the Trigger, Timing and Control (TTC)
network, formulates triggers based on the Global Trigger and other sources,
applies pre-scale factors and introduces dead-time when necessary to avoid
readout and front-end saturation;
• the Data Acquisition (DAQ), shown in Figure 6.3 . The L0 output is sent to
73
74 CHAPTER 6. TDAQ
Figure 6.1: ATLAS Phase-II TDAQ system scheme.
6.1. TDAQ BASELINE 75
Figure 6.2: Level-0 trigger system of the ATLAS Phase-II TDAQ.
Figure 6.3: Scheme of the Data Acquisition system of the ATLAS Phase-II TDAQ.
76 CHAPTER 6. TDAQ
all detectors at 1 MHz. The Readout subsystem (Readout) uses the FELIX
environment to interface with the Data Handler components, the Dataflow
subsystem (Dataflow) containing the Event Builder, the Storage Handler and
Event Aggregator engines. A PC farm is used for the DAQ;
• the Event Filter (EF), shown in Figure 6.4, composed of a CPU processing
Figure 6.4: Scheme of the ATLAS Phase-II TDAQ Event Filter system.
farm and of the Hardware-based Tracking for Trigger (HTT). This refines the
trigger data to reach the final rates supported, 10 kHz targeted. HTT includes
a regional HTT (rHTT) and a global HTT (gHTT) for tracking. EF is the
final stage before the storage operations.
The hardware technology planned for the baseline scenario is mostly based on the
ATCA (Advanced Telecommunications Computing Architecture) bus. The cards
will be controlled and configured by a System-on-Chip through standard network,
with a DCS (Detector Control System) interface. Table 6.1 lists the ATLAS Phase-
II TDAQ hardware size. The Level-0 has specific requirements for all the detectors.
The DAQ specifications can be summarized as:
• capability to handle data input from the detectors up to 5.2 TB/s;
• Readout and Dataflow shall handle an average data rate of 5.2 TB/s detector
input and up to 2.6 TB/s EF, always being able to sustain high peaks of
throughput without large congestion or data loss;










































































Table 6.1: ATLAS TDAQ Phase-II size. GEP = Global Event Processor, MUX = Multi-
plexer Processor, SL = Sector Logic, CTPMI = CTP Machine Interface, CTPIN = CTP
Inputs, LTI = Local Trigger Interface.
• Readout links should handle the various protocols and bandwidths by having
a single common interface with the front-ends. Control and configuration data
shall travel through the radiation hard protocol low-power Giga Bit Transmis-
sion (lpGBT), with a 10.24 Gb/s bandwidth;
• the common ATLAS DCS subsystem should be interfaced with the detector-
specific control and configuration procedures;
• it should be able to handle a 48 hours run without save data to offline storage.
This means that at average throughput of 60 GB/s of uncompressed data a
10 PB of storage volume is required.
The EF will require to:
• sustain a maximum input rate of 1 MHz and being able to select a 10 kHz
output;
• reconstruct vertices and tracks at will, differently from Run 1 and 2 TDAQ;
• have a linear time-consuming scale factor for the reconstruction algorithms, at
high pile-up up to < µ >= 200, achievable with today technologies;
• reduce the Dataflow and Storage Handler rate for the entire event at 400 kHz.
78 CHAPTER 6. TDAQ
In the TDAQ structure the complete control of the latency is crucial. The Level-0
Trigger bottom-up estimate of the full system latency is dominated by the L0Muon
since it will perform precise tracking using MDT long-drift-time data. 10 µs are
estimated for the latency budget of the overall TDAQ at the FELIX optical links
output. Using the worst case of the longest possible fiber length at the FELIX
input, the Current Best Estimates (CBE) for the TDAQ latency is ∼6.9 µs, within
the Maximum Possible Values (MPV) of 10 µs.
EF will have to accept 1 MHz of data and transmit 10 kHz of selected triggers;
the first reduction is from 1 MHz to 400 kHz, from L0 hardware trigger made by the
regional HTT. Then the EF processors reconstruct the event with algorithm similar
to the offline ones exploiting the global HTT capabilities. The rHTT will make
its decision with ∼10 % of ITk data, corresponding to the most relevant Region of
Interest (RoI) related to the L0 trigger. If this stage is passed and the global tracking
is required all ITK layers will be read out. The architecture chosen to implement
the strategy elaborated is based on two specific technologies: 28 nm Application
Specific Integrated Circuit (ASIC) and last generation (14 nm technology) Field
Programmable Gate Array (FPGA). The first is (prototypes already exist) a custom
chip developed as an Associative Memory infrastructure for fast pattern recognition,
while the second has the objective of track reconstruction and fitting. Anyway, HTT
is still under study and the possibility to use other technologies as CPU farm or
GPGPU farm or machine learning infrastructure is not excluded.
Together with HTT in the EF there will be, as shown in Figure 6.1, a Processor
Farm which will select the data from 1 MHz to 400 kHz with rHTT help. Figure 6.5
represents a scheme of the HTT units. The first-stage is occupied by the Associative
Figure 6.5: Phase-II TDAQ HTT logic scheme.
6.1. TDAQ BASELINE 79
Memories ASIC which have pre-saved clusters (a group of hits from the detector)
patterns to find candidate tracks (defined in this stage as ”road”, explained later).
The input data from the ITk layers will go through a clustering operations first, to
find the center of gravity of a set of pixels or strips, generating clusters from the
hits. The chosen clusters from the AM ASICs will than go through a linearized
track-fitting algorithm to extract track parameters. The fit elaborates the selected
clusters with the ones from the same layers of ITk used by the AM ASIC. Each
candidate track is extrapolated and finally it is operated a complete fit using all the
ITk layers to improve the track parameters. This second stage is FPGA based. HTT
and Processor Farm communicate via the HTT InterFace (HTTIF), which interfaces
with the HTT main board Tracking Processor (TP). HTTIF receives information
and data from EF and ITk and sends back the final tracks of the second stage
using the ATCA bus. The TP is an ATCA card which will assemble with two other
different boards: either the Associative Memory Tracking Processors (AMTP) or
the Second Stage Tracking Processor (SSTP). Different firmware will handle the first
and the second stage. Table 6.2 shows a summary of the HTT project. AMTP card
rHTT minimum track pt 2 GeV
rHTT Input rate 1 MHz ∼10 % ITk
gHTT minimum track pt 1 GeV
gHTT Input rate 100 KHz
Number of HTTIF 48
Number of ATCA shelves for AMTPs 48
Total number of AMTPs 576
Total number of AM chips 18432
Number of ATCA shelves for SSTPs 8
Total number of SSTPs 96
Power estimate per TP 300 W
Table 6.2: Summary of the HTT project for the baseline scenario.
will be an ATCA board completed with mezzanine card called Pattern Recognition
Mezzanine (PRM). These will host 20 AM ASICs each. The SSTP core will host
two Track-Fitting Mezzanine (TFM), performing the second stage fitting central
algorithm. The PRM will mount only the AM ASICs with a direct connection from
the TP to the ASICs, while the TFM will be based only on FPGA. The baseline
scenario proposed a solution including six AMTPs and 1 SSTP covering one RoI of
the detector. The detector volume studied by a single HTT block, consisting of seven
ATCA cards, is divided in RoIs. Each RoI is composed by the ITk elements with
track parameters η and φ within a range of ± 0.2 bin in η and φ, pt above 2 GeV for
rHTT and above 1 GeV for gHTT, |z0| < 15 cm and |d0| < 2 mm (where d0 represents
the transverse impact parameter given by the distance of the closest approach in the
transverse plane and z0 is the z-position of the track’s closest approach to the beam
pipe). The processing events occurring in the HTT block start with the hits from
80 CHAPTER 6. TDAQ
the pixels turned into clusters by finding the center of gravity of the released charge
of a set of pixels or strips. These clusters are then transformed in entities called
superstrips, a group of consecutive silicon strip or pixel channels, where each one
is labeled with a SuperStrip IDentifier (SSID). After this the pattern recognition
operation compares the eight superstrips, one from each layer at a time, with sets
of pre-defined patterns derived by simulation of training muons. Each pattern is
composed of eight superstrips, one per layer. If a set of SSIDs are matched in a
specific pattern bank (a set of eight or less patterns), and a pattern bank represents
a so-called ”road”, then this road and the SSID are sent to the last stage of selection.
A road is a selected zone by the simulations, performance studies and physics goals,
to be further investigated by the tracking algorithms, because they could be the
pre-defined trigger tracks. The track fitting, implemented by FPGA in the first
stage in the PRM (including only few ITK layers) and in the second stage in TFM




Cijxj + qi (6.1)
where pi represents the track parameters pT , φ0, φ, η, d0 and z0. pi is dependent on
the full-resolution local cluster coordinates xi and the constants Cij and qi are instead
specific for each sector. One sector includes one module of each layer combined for
all eight layers. The eight PD layers are used in the first step of the process. The








represents the quality of the fit. It is a χ2 and depends on other two coefficients Aij
and ki which are evaluated per sector. To give an idea of the amount of them, a 0.2
x 0.2 (in η and φ) region requires about 40 million coefficients. These two values
represent the keys to pass through the HTT internal steps and start concluding the
TDAQ chain.
6.2 HTT Evolved (L1-Track)
The solution just described has the possibility to be available in Run 4 physics envi-
ronment if two criteria are not broken: the hadronic trigger rate and the inner Pixel
Detector layer occupancy must not go higher than the expected values. Anyway the
TDAQ hardware-based trigger infrastructure has been designed to evolve if these
cases occur. The new ”evolved” solution is based on the inclusion of a new level
of trigger, Level-1 (L1), which shall process regional data from the strips and the
outer Pixel Detector layers. This additional stage would allow a preliminary vertex
selection translating in a higher hadronic background rejection and consequently a
6.2. HTT EVOLVED (L1-TRACK) 81
lower readout rate for the inner layers. The requirements that would differ in the
two scenarios would be composed of the following aspects:
• the regions of interest for tracking shall be provided by the Global Trigger;
• the creation and transmission of regional data should be done by the Level-
1 based hardware trigger. This is produced by the Front-End, sent by the
FELIX system for the Strip Detector, then extracted from the Level-0 data
stream and sent always by the FELIX system for Pixel Detector;
• the HTT hardware should be reduced;
• dependently from L0 acceptance, regional pt > 4 GeV tracks shall be processed
by the Level-1 trigger;
• the trigger rates and the rejection factors shall be updated. The processing
blocks should be reconfigured as:
−rejection of 2.4/7 to allow a 800/600 kHz L1 rate;
−based on L1 acceptance, the remaining ITk data should be read out at
800/600 kHz assuming a L0 acceptance of 2/4 MHz;
• the total L1 hardware based trigger latency shall not pass 30/35 µs in the 4/2
MHz L0 acceptance scenario.
Figure 6.6 shows the HTT evolved scheme. Some sub-detectors shall go through
small modification. ITk PD should require a higher readout rate of 4 MHz for the
layers 2 to 4 where the data selection for L1 would be needed off-detector in Readout
System based from Global Trigger information. ITk Strip Detector should reach 4
MHz at L0 without changes. The NSW actions in this scenario are under studies and
one of the most relevant would be the readout of the sub-detector by L1, without
causing any data loss. ATLAS would be touched entirely and should need different
solutions in all its parts in the evolved case. In particular:
• HTT should be the primary reduction operator of L0 for an EF affordable
farm size;
• a Region of Interest Engine (RoIE) shall be added to the Global Trigger in
order to calculate RoI dependently from L0;
• RoI shall activate Regional Readout Requests for ITk strips and off-detector
data of the ITk PD outer layers;
• L1 would be a regional hardware-based tracking system reconstructing tracks
by these RoI information and using the same HTT components in a different
architecture;
82 CHAPTER 6. TDAQ
Figure 6.6: Schemes of HTT baseline and evolved solutions. The main differences are the
L1Track, L1CTP and the shift of rHTT from the EF to the L1 Track.
6.3. HTT VARIANT 83
• Global Trigger should receive these regional track data and combine them
with the same calorimeter and moun based objects as foreseen in the baseline
scenario;
• the output of the Global Trigger shall be sent to L1 CTP which, at the end of
its operations, shall sent the Level-1 Acceptance signal to all the detectors to
let them read out the full event data.
Summarizing the difference in the two scenario strategies, the rHTT would be moved
in the new Level-1 Track structure while gHTT would remain in the EF sub-system
as co-processor.
It should be said the recent decision by the ATLAS management seem to reject
the L1-Track solution as described in this form.
6.3 HTT Variant
The evolved solution re-organizes some of the internal sub-systems to be able to
accept a higher rate in the first levels but has the same trigger output in the last
level. The ”variant” solution remarks the same architecture idea and rate achieve-
ment of the evolved scenario, but theorizes that the EF output rate and algorithm
implementation can be handled by only the PC farm, without a hardware-based
component coupling with it. Scheme in the right side of Figure 6.6, without the
rectangle HTT inside the Event Filter, represents the changes respect the evolved
solution.
84 CHAPTER 6. TDAQ
Chapter 7
HTT Alternative Solutions
The HTT setup requires to achieve two specific goals that usually are not present
in the same spot: following a large amount of mathematical operations and fully
controlling the system latency. To obtain them, in the most time restrictive blocks
of the HTT chain, two technologies are exploited: ASIC and FPGA, as mentioned
above. However, today FPGA, GPU and CPU at 20 nm, 14 nm or even 7 nm
(next future) technology node allow testing and applying more algorithm strategies
not affordable before. This part of the thesis covers the implementation on FPGA
device of the Hough Transform algorithm to the case study of ATLAS Phase-II,
as an ”almost plug-and-play” alternative to the Associative Memories (AM) ASIC
solution within the HTT structure.
7.1 Hough Transform applied to ATLAS Phase-II
The Hough Transform [16] (HT) algorithm is a tracking method already used in
general pattern recognition. In more details it is used to extract lines, straight or
curved, usually from digitized images, or in general from granular matrices. This
type of algorithm is suitable for the ATLAS tracking structures and with a high level
of parallelization. These two reasons are the main points for the implementation of
this algorithm in a FPGA device. The procedure required by the Hough Transform
algorithm is shown in Figure 7.1 and 7.2 using the straight line formula
y = x ·m+ q. (7.1)
The line can be expressed in terms of the slope and offset (m,q), for example
q = y − x ·m m = y − q
x
. (7.2)
This change of parameter space from x,y to m,q can be used to recognize and extract
which points x,y are part of a line parametrized with m,q. The concept is based
on filling the new parameter space m,q with the lines generated by the x,y values.
These lines will cross in [m:q] points of the space. These [m:q] points will represent
85
86 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
Figure 7.1: Example of Hough transform based on the straight line formula r = x·cos(θ) +
y·sen(θ). The left graphic has the coordinate y in the vertical axis and x in the horizontal
axis, the right graphic (Hough parameter space) has the coordinate r in the vertical axis
and θ in the horizontal. The two white dots in the Hough parameter space represent the
r and θ line values of the straight lines on the left.
Figure 7.2: Example of Hough transform based on the straight line formula y = x·m
+ b(q in the text). The left graphic has the coordinate y in the vertical axis and x in
the horizontal axis, the right graphic (Hough parameter space) has the coordinate b in
the vertical axis and m in the horizontal. The purple dot in the Hough parameter space
represents the m and b line values of the straight line on the left.
7.1. HOUGH TRANSFORM APPLIED TO ATLAS PHASE-II 87
the lines generated by the x,y points in the first space. These concept adapt to each
problem studying straight lines (or curved). The application proposed was to use
this algorithm for the ATLAS Pixel Detector tracks. To show the path that led
to the final form of the algorithm parameterized for the experiment requirements
(already demonstrated by Mikael Martensson in his PhD thesis [15]), we start by
visualizing the basic formula of this transformation starting from the electric force
−→
F = q−→E + q−→v ×−→B (7.3)
where q represents the charge of the particle, E the electric field, v the speed of the
particle and B the magnetic field. A charged particle generated in the ATLAS elec-
tromagnetic field moves in a curved track that can be adapted to the HT algorithm
of straight lines. In particular, using circular coordinates
r(θ) = 12
(y21 − y22) + (x21 − x22)
(y1 − y2)sinθ + (x1 − x2)cosθ
(7.4)
(x1, x2) and (y1, y2) represent two points and (rsen(θ),rcos(θ)) the coordinates in






(expressing the momentum in GeV/c). The Hough Transform formula becomes
qA
pT
= φ0 − φ
r
(7.6)
with A = 3 · 10−4GeV c−1mm−1e−1 and approximating for small φ0 regions. Figures
7.3 show this parameter space change graphically where
Figure 7.3: On the left: simplified example of a track in the ATLAS ITk detector. Y and
X are the ATLAS coordinates, φ0 represents the angle of the track respect the x axis. On
the right: Hough parameter space for the HT ATLAS tuned algorithm.
88 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
• q is the particle charge;
• pt the transverse momentum of the particle;
• φ0 the angle of the track with respect to the ATLAS X axis in the ATLAS
coordinate system;
• r the radius position of the particle
• φ the angle of the input hit with respect to the ATLAS X axis.
This application allows to describe the physical tracks (actually candidate roads at
this stage) as points, in the Hough parameter space. These points are the intercon-
nections of crossing lines representing the clusters in the detector. To summarize in
a nutshell: the Hough Transform applies a space parameter modification to max-
imize the gain of the parallel procedure looking for straight lines in a pixel-based
data image. This is just a version of the HT algorithm, in fact it is branched in
many options specific for different applications in terms of information to extract.
As mentioned above, the AM ASICs are planned to be mounted on a mezzanine
board connected to the TP ATCA card, and their task is to read out the clustered
pixel or strip hits and compare the input data to pre-saved patterns in memory.
If the comparison succeed the output cluster are called ”roads”. In few words, a
CAM (Content Addressable Memory) procedure is used to extract roads from input
clusters by a pre-saved roads matching method. The HT algorithm must acquire the
same inputs and transmit the same output of that used with the AM ASIC, using
a different hardware. This has been demonstrated (always by Mikael Martensson
[15]) to be achievable by implementing a HT structure with the specific dimension
of the 2D histogram used as parameter space, graphically describing the clusters
in input using the formula 7.6. This is sized 64 bins for qA/pt and 1200 bins for
φ0. The numbers come from software simulations by comparing AM ASIC road
extraction performance. The sized 64 x 1200 bin matrix is called ”the accumulator”
of the HT space. Figure 7.4 and 7.5 show the accumulator for a single muon track
and with a minimum bias event of 200 pp interactions. The size of the HT space
mentioned before is to cover the required trigger area of one RoI. There is another
concept to be considered in the application of the HT algorithm which is related
to the ATLAS performance. It is described graphically in Figure 7.6 . This image
shows the accumulator with the binning needed for one RoI. Each bin stores the
information of the clusters coming from different layers. The search of a road is
made by checking 5 bins along φ0 concurrently: the central bin must have 8 clusters
from the 8 layers, the left and right bins at least seven, the left-left and right-right
at least six and all the clusters for a specific bin must come from different layers. For
example, three clusters coming from the layers 0,1 and 2 and three clusters all com-
ing from the layer 4, will end up in a total number of four, not six, because the two
clusters coming from the layer 4 after the first one are not considered in the total.
This feature is necessary to improve signal efficiency and background suppression.
7.1. HOUGH TRANSFORM APPLIED TO ATLAS PHASE-II 89
Figure 7.4: HT accumulator with clusters of a single muon track.
Figure 7.5: HT accumulator with clusters of a single muon track and minimum bias events
corresponding to 200 proton-proton collisions.
90 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
Figure 7.6: Representation of the bins constraint to declare a bin a candidate road in the
accumulator. The five counters must increase only if clusters from different layers come
across that bin.
As said, the number of bins of the accumulator along with the five-bin road search
are the required features to implement the road extraction and make it comparable
with the physical performance of the AM ASIC solution. The next sections describe
along with the FPGA technology used the hardware feature required to achieve the
performance of the AM ASIC structure.
7.2 FPGA Technology
As mentioned in the introduction to this chapter, the technological progress achieved
by the transistor manufacturing systems is fundamental to make worthy, in terms
of performance, possible alternative strategies to the HTT AM ASIC. The devices
targeted for the implementation of the algorithm described in the previous section
are the XILINX FPGA of the Ultrascale+ generation. These hardware gain their
performance from a 14 nm technology and by the availability of all the necessary
components. The applied technology is what makes the most important feature of
these type of FPGA, the intrinsic delays of the internal components, allowing to
achieve high parallelized logic structures. This advanced technology node allows
very high-speed internal logic and hosting a huge amount of components available
on the FPGA. The market of these devices includes many possibilities related to
their price, with many hardware options more suitable for different tasks. Another
important feature of this generation of FPGAs is the control speed of many com-
plex internal components, such as Digital Signal Processor (DSP), internal buffers,
Block RAM memories. For example, the DSP max frequency achievable by the best
performing XILINX devices is 891 MHz, depending on the ”speed-grade” version of
the FPGA (from 1 (slowest) to 3 (fastest)). All these capabilities are important for
7.2. FPGA TECHNOLOGY 91
matching the latency requirement mentioned earlier by parallelizing the necessary
operations and concurrently respecting the FPGA internal signal travel timing. Fig-
ure 7.7 graphically shows this latter concept. The Figure shows that each ”path”
Figure 7.7: Logic description of the FPGA timing closure. In the upper case the signal
from the clock driven component A to the clock driven component B is moved without
processing operations, in the lower case the signal go through a series of asynchronous
digital operations.
(technical name identifying signals between two clock driven components) in the
FPGA, if synchronous to the clock, must have a delay t (in the figure the time
from the component A to the component B) lower than the clock period. This is
mandatory because otherwise the receiving synchronous component would read the
data before it is updated and this may lead to an erroneous behavior. FPGA tech-
nology has another important feature recently achieved that allows high-bandwidth
communication (over 10 Gb/s) via intellectual properties (IP) of the manufacturing
company, generally called ”transceiver”. With the term transceiver it is indicated a
communication that can act as transmitter and receiver concurrently. Two simplex
data streams using the differential signal data exchange Low Voltage Differential
Signal (LVDS), where a data stream ”positive” is the stream desired to transmit
and its ”negative” counterpart are sent (this to clean the data stream). At the
reception station, the transceivers use IP Phase-Looked Loop (PLL) and other com-
ponents to allow a throughput up to 10, 16, 24, or even 32 Gb/s (depending on the
device). FPGA have now the capability to multiply a reference clock up to a factor
of 80. The logic 1 and 0 applied by this fast technology has a swing (difference
between voltage 1 and voltage 0) lower then 0.5 V, and exploits the serialization-
de-serialization technique to exchange data safely using a clock and data recovered
method to synchronize the phases of the transmitter and receiver. A vast range of
issues and stop-points can occur when this type of FPGA are pushed to their limits,
problems that the compiler of the tool can’t always solve using default configura-
92 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
tions. Many different strategies can be applied to drive the compiler to solve timing
issues and congestion. A common problem in high-parallelized logic that can happen
is the fan-out violation, when a single information must drive too many components,
creating a too large routing. The point can be solved by forcing the compiler to limit
the interconnections to a defined value. Another important strategy is to drive the
compiler in placing and routing first the more complex structures of the firmware
before others, so to focus the compiler in more intricate structures with respect to
the easier ones. Appendix B describes the FPGA functionalities in more detail.
7.3 FPGA Implementation of HT Algorithm
The most suitable way to demonstrate the validity of the tuned HT FPGA structure
under development is to set up a demonstrator as close as possible to the final con-
figuration, including the same technologies required. However, the implementation
path to reach the demonstrator is surely long and full of other important steps to
follow. First of all the code must be validated in deeper simulations and, because of
the scalability of the design, at the full size. The modern tools allow simulating all
the FPGA blocks instantiated with the estimated travel time of the signals inside
the device (times with a 1 ps resolution). Additionally there are third-party sim-
ulation tools with the important feature of being able to simulate different FPGA
families and even different device vendors. The latter is the last step to follow be-
fore programming the real device and running the setup, to avoid all the possible
issues that these software simulations could explore and find. The tracking algo-
rithm that we are studying, in fact, would benefit from simulating the latency, the
power consumption, the temperature resistance, the data flow stability as BER and
so on. In terms of the performance for the road extraction, clusters results, fake road
and cluster rejection, different software studies are also fundamental. Therefore, in
parallel with the firmware design, a software research to fine-tune the parameters
shown in the previous paragraphs is crucial. Finally, the input clusters to use dur-
ing the simulation or demonstrator (which will represent the emulation of the final
system) and the validation of the firmware outputs will have to undergo a suffi-
ciently long run to have enough statistic and study the performance. The official
ATLAS tools for the data simulation and analysis will be exploited not only for
the physics studies mentioned above to fine-tune the algorithm, but also to produce
the test-vectors required for the demonstrator and analyze them. The setup of the
demonstrator planned to be built in the next months is shown in Figure 7.8 . The
final hardware implementation plan at present is to substitute the AM ASIC with
the FPGA/FPGAs required to implement a ”near plug-and-play” swap between the
AM ASIC mezzanine (PRM) and an alternative FPGA mezzanine. This means that
the ”Hough Transform” block in Figure 7.8 will have to implement all the reasonable
logic to emulate the final version in the most realistic way. The idea might be to
start from the ”Test-vector generator”, an ATLAS software tool to produce a set
of test-vectors to be used in the ”Data generator” device. These vectors will be
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 93
Figure 7.8: Logic description of the proposal of the ATLAS Hough Transform demonstra-
tor.
stored on the internal Block RAMs of the FPGA. Then, the firmware planned to be
implemented in a commercial evaluation board will work as an emulator of the final
data transmitter. This board could also be used to receive the output data from the
HT hardware emulator. These results might then be transmitted to a PC (via file or
PCI Express). The fine structure of the emulator such as the number of input and
output lanes and the choice of the evaluation boards will be exploited in the near
future. In fact, on the market there is a large variety of demo-boards in principal
compatible with our requests. The setup preparation is still in the first steps, with
the central firmware that has passed several tests and now it is in the last steps to
finalize the design, perform more realistic simulations with clusters generated by the
ATLAS software tool and to get the timing closure of the logic structure on FPGA
to allow simulations with the estimated internal delays (called timing simulations).
7.3.1 Logic Structure of the HT implementation on FPGA
The focus of this thesis is the development of the firmware design and implementa-
tion of the board ”Hough Transform” shown in Figure 7.8, that is a part of the col-
laboration in the construction of the demonstrator. Besides the constraints shown in
section 7.1, there are other restrictions related to the goal of comparing the firmware
and the results of the tests with the AM ASIC hardware design and performance.
First of all the clock source for the logic should be the same of the AM ASIC, 250
MHz, which is a very high target for a FPGA implementation. The input/output
configuration, which should include transceiver technology because of the speed re-
quired to communicate off-board in range of 10 Gb/s, is configured as eight inputs,
related to the eight layers of ITk, and sixteen outputs, related to the road paral-
lelization. The presumed number of clusters per road is up to sixteen. The last
94 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
important constraint to respect is the latency. In this case there are two different
latency values to consider: the latency shown above in the entire HTT description
and the value related to the AM ASIC hardware performance. This latter is about
the time that elapses between the last set of eight clusters entering the logic to the
first road sent in output. This latency shall be up to the one achieved by the AM
chip, that is 175 ns. Before going in the detailed description of the firmware design,
a recall of the basic idea of the Hough Transform is here presented.
In our case the HT is based on a formula that generates straight lines in the
Hough parameter space (accumulator) starting from the physical cluster space (r,φ).
In particular for each input cluster originated from the eight ITk layers, 1200 pro-
cesses are done correspondent to the 1200 φ0 bins. So, every clock period, 1200 x
8 = 9600 information must be created and the accumulator must be updated with
them. These information represents the points which form the eight lines, one per
cluster. As the entire input event is built of 500 sets of eight clusters, altogether
the system must load 4000 clusters distributed on eight layers. When this process
finishes the event is loaded and the system latency counting starts. In more detail,
once the accumulator is loaded with the event converted into the Hough space, the
process to search the roads starts. This consists of applying the procedure shown in
Figure 7.6. For the central bin, the corresponding coordinates φ0,road and qA/pt,road
are extracted. These coordinates are the ones for the ”road candidate”. From here
the process to search back which input clusters have contributed to that road, starts.
This further process is performed by running again the Hough formula by testing,
for all the 4000 input clusters with (r, φ) parameters, whose have generated the
qA/pt,road given the φ0,road. These clusters must be sent out, as said, in parallel on
up to 16 high-speed lanes (transceivers).
The chosen design to implement this with the constraints listed is shown in
Figure 7.9 . This image represents an overview of it, being made of several blocks.
Figure 7.9: Overview of the HT firmware logic.
Starting from the left blue rectangle, eight clusters enter concurrently each 4 ns.
Each cluster goes through 1204 Hough operations (formula 7.6) to calculate all the
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 95
points needed to draw the line representing the Hough transformation. The number
of operations is related to the amount of bins in φ0 added to 4. This is because,
as shown in Figure 7.6, the number of bins, to check at the same time, needed to
decide if a bin [qA/pt:φ0] is a candidate road or not, along φ0, is five. So to be
able also to check the first two and the last two bins along φ0, two bins in this
coordinate must be added before the start and after the end of the coordinate.
The results of these operations are then used to activate specific bins in the Hough
transform parameter space and draw the eight Hough lines, one per cluster. The
image of the accumulator differs from the one shown in Figure 7.3 because the
accumulator used in the firmware applies a Hough accumulator per layer, this for
firmware reasons regarding the resources usage. In fact, to map all the clusters
which would have passed through a bin would require more resources than to apply
eight accumulators. The chain of qA/pt calculations and accumulator updating is
run until all the clusters are uploaded into the HT parameter space. After that
the road searching shown in Figure 7.6 is used to extract one candidate road every
two clock periods. After that, by applying the Hough formula 7.6 with the road
value of φ0 and the values of the clusters r and φ as inputs, the qA/pt is calculated.
Then this is compared with the qA/pt of the road. If they respect a pre-defined
comparison rule it means that the cluster [r:φ] is part of that road. This operation
is done simultaneously over all the clusters, now planned to be 4000. Lastly, there
is a block of multiplexers which extract which clusters passed the comparison and
send them to output, sixteen concurrently.
A more detailed scheme of the firmware design is shown in Figure 7.10 describing
Figure 7.10: Detailed scheme of the logic blocks interconnections of the HT firmware.
how all the smaller logic blocks are interconnected. This structure will be connected
to the transceivers of the FPGA: eight transceivers will be instantiated to manage the
eight lanes receiving the SSIds in input, with a frequency up to 250 MHz, depending
96 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
on the final configuration for HTT, while the output will be structured in sixteen
lanes that transmit up to the same number of SSIds (SuperStrip IDentifiers) always
up to 250 MHz, simultaneously. The encoding protocol has not been decided yet,
anyway the one used, as it is required to have a complete timing analysis, is the
Aurora 64b/66b (Appendix A). This is used because it is a standard protocol not
particularly complex in terms of FPGA resources. These chosen parameters end-up
in a final bandwidth of 16 Gb/s per lane, in input and output, totally streaming
128 Gb/s in input and 256 Gb/s in output. The actual total amount of bits used
to transmit the SSId values is 18, so a different coding protocol should be utilized
in the final version. A good candidate could be the Interlaken (45,46) which has
the interesting feature of been available for both the FPGA vendors targeted by the
HTT project, XILINX and Intel-Altera. However the use of FPGA resources and
timing restriction for these transceiver parameters are mostly negligible. Inside the
transceivers shell there is the core of the firmware, the Hough Transform code, shown
in Figure 7.10, starting with eight SSId, coming from the high speed lanes, entering
an engine which applies the coordinate transformation between SSId and radius (r)
and angle (φ) of the cluster. This operation is still to be implemented. After this
the eight clusters, now in [r:φ] format, are used by the logic which generates the
9600 qA/pt values to update the Accumulator. For digital optimization reasons it
is necessary to apply a set of operations to avoid losing information. Figure 7.11
shows this set. The two r, φ parameters undergo two different processes: φ0,n − φ
Figure 7.11: Scheme of the calculation of the qA/pt for the first sector in the HT firmware.
and 220
r
where n represents the φ0 binning. The second operation is applied because
r is encoded with (at today, it may change in the final configuration) 10 bits (while
φ and φ0 use 16 bits), from 0 to 1023, meaning that to avoid loosing information
from the resolution, the numerator must be at least 220. Moreover because of the
next operation is a division applied with a digital truncation, the numerator must










7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 97
The results of these two calculations are then multiplied using a DSP. The result
of this latter therefore follows two paths: one to a digital truncation of 20 bits to
apply a division of 220 to remove the numerator and the other to a set of operations
described in Figure 7.12 . The latter is part of one of the approaches used to put
Figure 7.12: Scheme of the calculation of the qA/pt in all the sectors excluded the first.
The first box on the left states 1200#sectors .
under control the FPGA resources, called ”sectors” method. Figure 7.13 shows it,
Figure 7.13: Logic representation of the sector method applied at the Accumulator. The
line in the first sector (on the left) is translated in the others using ∆φ0,n of the sector for
the operation.
where the accumulator is divided in n sectors along φ0. For the first sector the qA/pt
values are calculated using the Hough formula 7.6, while the others are extracted




value (l represents the number of sectors), where ∆φ0,l is constant for each sector
and represents the value of φ0 covered by all the bins of that sector. This allows
to reduce the multiplications to process and consequently the DSPs to use. For
example, without using the sectors we have 9600 multiplications, with 2 sectors we
98 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
have 4800 operations, with 10 sectors we have 960 operations and so on. In Figure
7.12, the formula 7.8 is calculated always by applying the multiplication by 220/r
for the same reasons as above and then by adding it to the qA/pt value previously
calculated. The last operation is again a digital truncation to extract the qA/pt of
the sectors. In other words, a translation of the first part of the line is applied to all
the sectors. All these are applied each φ0 bin and to the eight clusters, concurrently,
resulting in 9600 total operations. This concludes the first rectangle on the left of
Figure 7.9. All the qA/pt values, those related to a specific layer and a specific φ0
by array addressing, are used as addresses to decide which qA/pt bin to activate, as
shown in Figure 7.14 . The number of bins describing the momentum, in today’s
Figure 7.14: Logic description of the uploading of the accumulator.
configuration, are 64, meaning that a vector of six bits describes all possible cases.
The qA/pt bin range value is defined as 1 (0,1,2,3,....,63), allowing to use this value
directly as a bin to be activated. The accumulator update occurs each clock cycle
thanks to its structure made by Flip Flops, that gives the capability to access it all
at the same time. This filling of the Hough parameter space continues for all the
clusters, now defined as 4000, eight at a time, needing 500 clock cycles. After this,
the accumulator is filled and ready to be studied using the bin clustering shown in
Figure 7.6, in 3D view as the firmware implementation in Figure 7.15 , to extract the
candidate roads, one at a time. Here the second structural approach is used, called
”sliding windows”, shown in Figure 7.16 . The concept is to not check the whole
accumulator at once, for the road searching, but only smaller windows, portions
of range ∆qA/pt and ∆φ0 bins. For each one of these ”windows” it is performed
the operation to count, for each bin [qA/pt:φ0], how many layers are active, or in
others words how many clusters crossed that bin counting only clusters from different
layers. This ends-up in an array of Flip Flops with the size in qA/pt x φ0 bins of
the window represented by 2D bins made by two bits instead of eight (because of
the eight layers, mentioned before). These two bits are used to describe the four
possible cases searched for: 8, 7, 6 or less than 6 layers. Figure 7.17 describes it.
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 99
Figure 7.15: 3D view of the binning constraint to extract the candidate road as imple-
mented in the firmware.
Figure 7.16: Image showing the Sliding Windows approach, where the red rectangle rep-
resents the window moving all over the accumulator.
100 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
Figure 7.17: Scheme of the Sliding Window Tower Finder firmware block, with on the
right an example of the 2-bits matrix with the four possible cases.
This second matrix is then used to extract the road candidates by looking for the
bin clustering shown in Figure 7.6, by checking the two bits five couples at a time,
for the entire matrix at each clock cycle. The road extraction logic is built to allow
at most the loss of two clock cycles in the case of a window with zero roads. The
road extractor sends to the next block a couple of values [qA/pt,road:φ0,road] every
two clock cycles. It applies the Hough formula as shown before in Figure 7.11 using
as input all the r and φ of the clusters and the φ0,road, so executing 4000 processes
concurrently to calculate qA/pt,calc. The results are then compared with qA/pt,road.
For example if qA/pt,road - qA/pt,calc < 1, these are considered clusters generating
that road and the results are saved in a 4000 bits vector with ’1’ indicating the
comparison was successful. The < 1 comparison has been chosen for simplicity. The
4000 bits vector is finally used as input of a multiplexer to extract the SSIds from
the Memory Bank. This multiplexer has an unusual structure, implemented in such
way to reduce the resources used. Figure 7.18 shows this structure. The results
of the 4000 comparisons in output from the block Computations and Comparators
are used as selectors of the first multiplexer in the center of the Figure. With them
16 values from a storage of 4000 data (SSID positions in the Figure) are extracted.
These data represent the position in the SSID storage of the SSIDs which generated
that road. The SSID position storage is a set of 4000 numbers from 0 to 3999 of 12
bits format. The output of the first multiplexer is then used as selector of the second
which extract the SSIDs which generated the road. Even if at first glance it would
seem resources and timing consuming, Table 7.1 shows the difference in estimated
resources compared with the method of using the Comparisons Results as selector
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 101
Figure 7.18: Scheme of the multiplexer system for the Cluster Extractor block.
Methods Look-Up Tables(40 clusters)
Flip Flops
(40 clusters)
1 multiplexer 6331 288
2 multiplexers 1070 + 3504 112 + 288
Table 7.1: Tests results of the clusters extraction methodologies tested.
102 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
of the second multiplexer, bypassing the first. The resources increase approximately
linearly as the clusters rise. Considering the targeted number of total clusters, 4000,
roughly 170’000 LUTs are saved. This method increases the latency of 4 clock cycles
with respect to the one with 1 multiplexer. Finally, referring to Figure 7.10 the
area called ”latency zone” is the one including the blocks which can drive the total
latency of the entire event processing. For example if a certain amount of sliding
windows would be empty of roads, the clock periods for their processing would be
without roads sent in output, increasing the total latency but not the amount of
information. All the logic is now tuned to be able to extract up to sixteen clusters.
Several parameters have been defined here, from the data format to redundancy
cycles of logic. These numbers have been set at safe values to consider the worst
scenario, for example the 4000 clusters per event. An important task for the next
months will be to finalize these parameters.
7.3.2 Implementation Techniques
With the term implementation I mean all the steps from the firmware code design
in VHDL up to the place and route of the hardware logic developed in a specific
FPGA. This includes the list of components and how they are connected inside the
FPGA. The two general rules to implement any firmware on a FPGA are to not
overpass the available resources and meet the timing. However the performance in
these two terms for a FPGA depend drastically on the device itself, even in the same
family these could change in terms of timing analysis results. Moreover the resource
usage is related to the timing closure of the infrastructure and vice versa, so it is
general convention to not exceed the 50 % of the device resources. The FPGA re-
sources required to implement a complete size of the HT firmware logic are still to be
defined. The complete scalability of the design allows it to be used to implement dif-
ferent size of it at different frequencies. Techniques of timing control and place and
routing management can help to extend the firmware to the requests in all of their
representations. From the development and study of this design the main problem
encountered, together with finding a well suited architecture, is the timing closure.
This fails in the default design by the so-called ”congestion”. The firmware conges-
tion is a phenomenon in which the resources and their interconnections require a
too large amount of physical space in the FPGA, forcing the hardware compiler to
meet the timing for some paths leaving behind others. And this is because there is
not enough space and configurable hardware density to implement all of it. Figure
7.19 describes it. The first shows the firmware parameterized with the 40 % of the
accumulator implemented, meaning 40 % of the initial qA/pt calculations, the 100
% of the cluster, excluding the Computation and Comparator and Cluster Extrac-
tor blocks (Figure 7.10) and at 250 MHz. The second Figure also shows it, but the
two are implemented in different devices, an Alveo U250 evaluation board and a
XCVU19P-FSVB3824-2-e. Table 7.2 shows the difference in resources and timing
analysis results. The timing performance shows that more area to organize the paths
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 103
Figure 7.19: The same version of the HT firmware implemented on a Alveo U250 (left)
and on a VU19P (right).








Alveo u250 1’720’000 3’450’000 12’280 - 3.8 ns 70’000 590’000
VU19P 4’085’00 8’175’00 3’800 - 2.3 ns 32’000 590’000
Table 7.2: Results of timing analysis of the same HT firmware version implemented in two
different FPGAs.
104 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
and resources allows for better hardware design. This occurrence is known and the
development tool gives the possibility to force the compiler to apply different strate-
gies based on the firmware structure and issues. Vivado strategy for the congestion
is called Congestion SpreadLogic, with three different levels of spreading, regarding
the amount of occupable area. Another technique used for the timing errors is the
standard approach of implementing pipelines in strategic points of the logic. The
term pipeline is referred to inserting registers (D-Type Flip Flop) within a path, to
reduce the logic distance between the two clock driven components and so the travel
time of the signal by separating the total combinatorial path. The simplicity of this
technique is comparable with its versatility, but each pipeline increases the use of
the resources, so there is a limit to its usefulness. Many pipelines placed in several
part of the firmware shown a worsening of the timing performance or an increase of
them too light compared to the enhancement of the resources. A specific problem
of this firmware is the fan out capability of an output signal to drive many inputs
in parallel, a characteristic also linked to the amount of current the output signal
manages. Inside a FPGA the signals are copied by ”crossroads” multiplying them
and sending them to the driven components inputs. The FPGA fanout limit set
by default is 10’000, which shown to be a good compromise. In the list of all the
possible constraints to be forced on the compiler there is also the maximum accepted
fan out. Along with highly specific strategies there are more general approaches,
and two of them have been studied: Performance Retiming, which applies more al-
gorithms than default in the operation of routing and physical optimization (which
exploits the specific features of the device targeted) and Performance Explore, which
tests many non-default algorithms in all the firmware implementation processes in









Alveo u250 Default 70’000 590’000 - 3.5 ns
Alveo U250 PerformanceExplore 82’000 590’000 - 3.2 ns
Alveo U250 HighCongestion 75’000 630’000 - 3.0 ns
Table 7.3: Results of the timing analysis of the same HT firmware version with different
implementation strategies applied to the firmware compiler.
the same design. Table 7.4 shows some examples of timing analysis results. The
techniques applied here are the placing of pipelines in different positions in the
firmware or forcing the compiler to reduce the allowed fan-out. These were done to
help the compiler by forcing it to build a more performing components distribution
in the FPGA area. The firmware used in all the tests shown in Table 7.4, apart
for the pipeline applied, is the same: 480 φ0 bins, 10 sectors, 250 MHz of frequency
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 105
























S. W. Tower Finder
VU19P -2.3 ns 32000/590000 none
VU19P -1.9 ns 25000/880000
pipeline of the
Accumulator to the
S. W. Tower Finder and
qA/pt values (x10) to
the Accumulator




Table 7.4: Table showing several tests done with the pipeline method and the forced fan-
out limitation. Here with pipeline is intended reducing the intrinsic travel time of the paths
between two blocks of the HT firmware (7.10). For example pipeline of the Accumulator
means adding a second Accumulator as buffer, or x10 qA/pt values pipeline is intended as
adding 10 times other Flip Flops used for the qA/pt results temporary storage, always as
buffer.
106 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
and without the ”Computation and Comparator” and the ”Cluster Extractor” im-
plemented. The resource usage affected by these pipelines is mostly the Flip Flop
amount. Because of the high parallelization of the HT logic, the insertion of pipeline
stages and the reduction of the limit on the fan-out didn’t improve the performance
as much as expected.
7.3.3 Current Status
The logic described above has been finalized and tested, including at hardware level,
up to the finding of the roads after the accumulator is filled. The operations of clus-
ter extraction are in progress to be optimized. Figure 7.20 shows the latency of the
blocks of the firmware. Here the values are in clock periods required to go from
the inputs of that block to its outputs. For example, in the block ”qA/pt values
calculation” the value ”10” means the time interval from the 8 clusters entry in the
block to the 9600 qA/pt values calculated from them amounts to 10 clock periods.
The blue blocks have been tested with hardware validation, the ”Computation and
Comparator” block reached good performance in latency and must be optimized
at resources level. The ”Cluster Extractor” block is going through the process of
resources and timing optimizations, with a latency prospected of 32 clock periods in
the worst case. The total latency prospected of the whole firmware is 55 clock cycles
in the worst case, which amounts to 220 ns with a clock frequency of 250 MHz. The
Figure 7.20: Detailed scheme of the HT firmware with the latency of each block, in clock
periods.
HT firmware has several parameters which drive its possibility to be implementable
on the current FPGAs. This because those could increase the resources and/or pre-
clude the timing closure of the system. Furthermore the device variety allows to
accomplish the same task in different ways. All these create a large parameter space
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 107
with different effects in the logic implementation. So an important task together
with the finalization of the firmware design is to study this space and find the best
configuration. Starting with the FPGA family and version, three devices have been
targeted based on their size: the Alveo U250 Acceleration Platform, the VU19P,
and the KU085. Table 7.5 shows their features. The first test done was to place
Device Look-Up Table Flip Flop DSP Price Transceivers available
Alveo U250 1’728’000 3’456’000 12’288 ∼ 4000 $ 24
VU19P 4’085’760 8’171’520 3’840 ∼ 7000 $ 80
KU085 497’520 995’040 4’100 ∼ 2000 $ 48
Table 7.5: Features of the three FPGAs targeted in the HT firmware proposal.
and route the same logic in the FPGAs to see the compiler behavior with three
different areas available for the paths. The firmware was the same HT code in all
three configured with the 20 % of the required φ0 bins, 4000 clusters managed in
groups of eight entering in 500 clock cycles in input, applying 20 sectors, using 40
sliding windows, running at 250 MHz and without the ”Computations and Com-
parators” and ”Cluster Extractor” blocks. These devices in Table 7.5 were chosen
also because they are in the average, high and low costs range respectively but with
the same technology XILINX Ultrascale +. The results showed that only the KU085
device wasn’t able to match the timing, with a Worst Negative Slack (WNS, the
slowest paths) of -6 ns and a number of failing paths of 110’000 over 300’000 paths
in total. The other two devices were able to match the timing with two values of
WNS of 0.077 ns for the ALVEO u250 and 0.014 ns for the VU19P. These results
demonstrate that larger FPGAs allow better path distribution and higher perfor-
mance even with the same resources used. These and other similar results with
higher values of the configuration parameters suggested to direct the development
toward using the VU19P. After the choice of the device, the placing in the FPGA
of the transceivers to be used was studied, showing different results depending on
the cases. The transceivers lanes are placed on both right and left sides of a device
or only in one, depending on their number and on the FPGA size. Each transceiver
has an I/O structure with the possibility of being used as input or output, in two
physically different connections. Configuring the firmware so that an input and an
output share the same transceiver can change the timing results depending on the
size of the logic implemented. More resources used would require more space to use
and more complexity to match the timing performance. If for example the firmware
would require to spread on the FPGA for a certain amount of area but the groups
of inputs and outputs implemented are too close between them, this would force the
compiler to place the resources in a non-performing way by not spreading enough
the architecture. Table 7.6 shows these occurrences. About the number of sectors in
which to separate the accumulator, the value used is 20, even if the 40 sectors case
results in the lowest number of multiplications to do. Not all the possible numbers
can be used to divide the accumulator in sectors. For example 40 sectors means 30














Alveo U250 40 10 250 -3.5 80’000 590’000 different I/O
Alveo U250 40 10 250 -3.5 70’000 590’000 I/O shared
Alveo U250 40 10 125 0.077 0 550’000 different I/O
Alveo U250 40 10 125 0.017 0 550’000 I/O shared
Table 7.6: Table showing the results of the tests of different placing of the transceivers.
bins each sector, and consequently 30 multiplications to do based on the structure
showed above, this for each layer. Then there are the multiplications to calculate
the ∆φ0/r, which are 39 per layer. These result in 69 multiplications per layer for 40
sectors implemented. In case of 20 sectors there are 79 multiplications each layer,
in case of 24 sectors 73, in case of 50 sectors 73 multiplications to do. However,
less multiplication doesn’t mean better timing results, as shown in the Table 7.7.














VU19P 40 10 250 -2.3 32’000 590’000 Type A
VU19P 40 20 250 -3.6 26’000 590’000 Type A
VU19P 40 20 250 -2 8’000 550’000 Type B
VU19P 40 40 250 -2.6 37’000 570’000 Type B
Table 7.7: Table showing the timing analysis results with different amount of sectors used
and with two different versions of the firmware (differing in the clock managing).
components and configure them manually. This because the compiler, if it instan-
tiates the DSP by default, forces them to complete the mathematical operation in
one clock cycle. Instead the manual configurable mode uses four clock cycles. Less
time to do an operation force to build a more complex structure and worsening the
timing results. However the multiplications can be implemented even using LUTs
instead of DSPs, and this test has been done. The results show that implement-
ing the multiplications before the accumulator operations with LUTs instead of with
DSPs allows to significantly reduce the paths failing the time and lead to the closure
without failing paths. The best results achieved until now are showed in Table 7.8 .
The version without DSPs implies an increasing of the resources of 20 % of LUTs,
from 480’000 to 590’000, for the number of φ0 bins used. As already mentioned,
the effects of a parameter on the firmware change accordingly to the correlations
with the others, so the campaign to find the best configuration must be continued.
The main options, as of today, driving the hardware logic, excluding the ATLAS
requests, are:
• number of sectors;



















VU19P 40, 20, 250 480 390 -2 8’000 550 multiplications beforeaccumulator with DSPs
VU19P 40, 20, 250 590 420 0.022 0 590 multiplications beforeaccumulator with LUTs
VU19P 20,20,250 240 210 0.014 0 260 multiplications beforeaccumulator with DSPs
VU19P 40,20,125 480 390 0.077 0 550 multiplications beforeaccumulator with DSPs
Table 7.8: Results of tests with and without DSPs and with different parameters (% φ0
and clock frequency).
• usage of DSPs or LUTs for the multiplications;
• amount and configuration of transceivers lanes available in the FPGA;
The next steps are: to finalize the design as to include in these tests the last two
blocks of Figure 7.10 (”Computations and Comparators” and ”Cluster Extractor”),
to continue the campaign to configure in the most performing way the firmware and
to test the architecture developed in further depth, for example with more realistic
test-vectors. A technique under development to reach a high percentage of the
logic is to configure the FPGA to have more HT bunches of firmware completely
separated, without shared paths. This would allow to take advantage of the timing
results of the individual firmware and then reproduce it on the FPGA a number
of times allowed by the device resources. This would use the Super Logic Regions
(SLR) of the FPGA, the largest areas by which the device is made. Between different
SLRs the interconnections are less performing and as shown in Figure 7.21 , if
possible the compiler tries to implement a single design in a single SLR. Here the
implementation was for two separate HT logic of 40 % of φ0, 125 MHz of frequency
and 20 sectors and without the last two blocks. Considering the VU19P device,
the goal is to implement four single HT firmware in it, without shared paths, at
the highest possible percentage of φ0. For example, this architecture with 50 %
of φ0 bins could reach two Hough Transform firmware blocks, until the Sliding
Windows Road Finding block, in one device. This could be very helpful to make
the solution more versatile for the ATLAS requests. An implementation with two
completely separated HT firmware blocks, without ”Computation and Comparator”
and ”Cluster Extractor”, was reached by testing two version configured as 40 % of
φ0 bins and 4000 clusters managed, using 20 sectors, running at 125 MHz, with
timing results of 0.018 ps as WNS.
To evaluate the correct functionality of the firmware is necessary to simulate it by
110 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
Figure 7.21: Images showing the placing and routing schemes of two HT logic in the same
FPGA (VU19P), completely separated at timing level. The left image shows the placing,
the right image shows the placing and routing. Because the transceivers used are all in
the right upper part of the FPGA there are some paths going from the lower to the upper
part of the FPGA.
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 111
software. The Xilinx tool allows to simulate the implementation on the FPGA with
the real travel time of the paths, as mentioned above. The status of the complete
HT firmware is now at RTL (Register Transfer Logic) level of simulation, meaning
that only the behavior of the code has been tested, not its real performance on the
device. Simulations with the real travel time of the paths have been done for the
design until the results of the road finding, showing the same outputs roads found
in the RTL case. Dummy roads are used with a small set of clusters (40, five groups
of eight). Figures 7.22 , 7.23 , 7.24 , 7.25 , 7.26 and 7.27 show screenshots of the
Figure 7.22: Screenshot of the RTL simulation tool showing a line (representing a cluster)
in one layer of the accumulator. The ’1’ represent the bins activated by the cluster. The
vertical axis is qA/pt while the horizontal φ0.
Figure 7.23: Screenshot of the RTL simulation tool showing a line (representing a cluster)
in one layer of the accumulator. The ’1’ represent the bins activated by the cluster. The
vertical axis is qA/pt while the horizontal φ0. Here the value of r is quite different from
before, showing the different slope of the line.
simulations done, with the caption describing what is shown. The simulations are
currently done separately on the HT core logic and the transceivers structure. This
because the latter requires a specific time and operation chain before being able to
acquire and transmit the HT outputs and inputs (calibration of the lane), chain
not finalized yet. Regarding the first hardware tests, Figure 7.28 and 7.29 show the
results of the HT firmware until the road finding. The first image shows a road
found continuously each 80 clock cycles (c.c.). This latency comes from the fact
that the 40 windows are checked one each 2 clock periods and the roads are two in
112 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
Figure 7.24: Image showing a window of the accumulator after the counting of the amount
of layers activated per bin. The vertical axis is qA/pt while the horizontal φ0.
Figure 7.25: Screenshot showing in the signal cnt SWRoads the moment a road is found,
in cnt qApt out and cnt phi out the sliding of the window through the accumulator and
in qApt end and phi0 end the values of qA/pt,road and φ0,road found.
Figure 7.26: Image showing the signal ResultCompar, which represents the vector listing
the clusters which passed the comparison and so which are part of a road. The number
of ’1’ in the white box represent the number and position in the cluster storage of the
clusters which generate that road.
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 113
Figure 7.27: Image showing the cluster in output to the logic. The 16 signals in the
lower-center part of the Figure are the 16 clusters of the dummy test-vectors.
Figure 7.28: Image showing the road found by the logic during the hardware test. Here is
shown that the road is found each 80 clock periods.
114 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
Figure 7.29: Image showing the road found by the logic. Here is shown that the road last
2 clock periods (there are two roads).
all the accumulator. The second image shows a probe checking for specific values of
the road. The probe used does not show the value of the bins of the Accumulator of
the road because that would require too much bits to monitor and the timing of the
firmware implementation would not be matched. The signal lasts 4 clock periods
because there are two roads and each last two clock cycles. The hardware used is
the Alveo U250, with a firmware configured as: 60 φ0, 250 MHz of frequency, 40
clusters, 40 windows and 20 sectors.
As mentioned before the transceivers development is still separated from the
HT firmware core. The implementation of the high-speed lanes has been forwarded
with the collaboration of a Master student (Giacomo Levrini). The scope of our job
was to instantiate a communication between two XILINX Ultrascale + demo boards
with a data generation customized and under control. The structure of the setup, as
explained in the colleague’s Master thesis [44], has the purpose to emulate the setup
of the real demonstrator, in the part of the transceivers communication. The tests
done are about the implementation of a transceivers loop-back mode communication
using a VCU1525 evaluation board. After the successful communication in loop-back
then the plan is to move to two boards sending data to each other. The number of
lanes used concurrently in the communication is eight, using as data protocol the
Aurora 64b/66b at 16 Gb/s each lane, with all the system running at 250 MHz.
These emulate the conditions for the HT firmware inputs. For the tests done a set
of data generated internally the FPGA has been used, a counter from 0 to 264-1.
The scope of the tests is to assure that implementing eight transceivers in input and
output on the board, the system complete the calibration of the protocol chosen.
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 115
This was achieved during the tests. After the transceivers calibration the generated
data are sent through the high-speed lanes. Finally the data are checked in the
receiver logic. Figure 7.30 and 7.31 show the results. The status of the development
Figure 7.30: Picture of the Vivado interface showing the behavior of the loop-back mode
data stream receiver. The signal diff 1 goes up if the difference between the data data out
at timestamp n and data out at timestamp n-1 is 1. diff 0 goes up if the difference is not
equal to 1. diff 0 should go up each 264 clock periods, but at the moments the instability
of the data stream activate it each 32 clock periods.
is at the loop-back mode version of the firmware to optimize. Now the achieved data
communication allows to transmit a set of 32 data. This allows to transmit to the
HT firmware 32 sets of eight clusters, a total amount of 256 clusters serially. The
development of this to stabilize the data stream and move to a two-boards setup is
still ongoing.
Concluding, the other parts of the demonstrator are still under development,
with the data generator and the Hough Transfom work in progress in the Bologna
University.
116 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
Figure 7.31: Picture of the Vivado interface showing the behavior of the loop-back mode
data stream receiver. Here there is a zoom of the same data stream to show the values
transmitted.
Conclusion
The next years will be crucial for the HEP experiments. The technological advance-
ment in many fields, from the sensors to the data processing, open a vast range of
possibilities. The biggest hadron circular collider in the world, LHC, and the physics
experiments connected to it will exploit this technological evolution. They also will
gain performance from innovative strategies now possible with the new instrumenta-
tion features. The LHC accelerator coupled with the 4 major experiments, ATLAS,
ALICE, CMS and LHCb achieved extremely important goals, as the most known
the Higgs boson first detection and measurements.
The work presented in this thesis regards my contribution to the two upgrades
which the ATLAS detector will undergo in the nearest future and in the next years.
These upgrades fulfill the increasing luminosity of LHC and the new physics goals.
In the Long Shutdown 2, ongoing now, I collaborated with the FELIX group at
the commissioning of the FLX-712 FPGA readout based cards. These boards will
be part of the upgrade of the ATLAS Phase-I TDAQ system, as new data acquisi-
tion technology and methodology. The FLX-712, showed in Figure 7.32 , has been
Figure 7.32: Picture of the FLX-712. 1) FPGA Kintex Ultrascale, 2) MiniPOD, 3) PCIe
Gen 3 connector, 4) MTP connector, 5) TTC mezzanine), 6) PCIe Switch, 7) Power
manager.
mostly tested in a commercial 4U PC setup. The goal of the FELIX acquisition
117
118 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
system is to commute high-end FPGA technology, exploiting the last XILINX gen-
eration Ultrascale +, with the commodity architecture of a PC. The commissioning
campaign concluded with 254 cards manufactured and tested. My contribution in
this collaboration has been to develop and follow part of the functionality tests of
the cards. The test purpose is to check the basic functionalities of the FLX-712
immediately after the power up. My work in this collaboration has been also to
be the technical support of the contractor which produced the cards. Primarily in
the 2019 my job has been about the preparation and tuning of the tests, to solve
software bugs occurred during the production and tests, to follow with the FELIX
developers the upgrades described in section 4.2. After the first 20 cards shipped to
CERN in the middle of 2019 and the acceptance tests done there, the production
continued. In January 2020 I stopped working on it. During all the period I assisted
remotely the manufacturer and joined it to follow the first tests on the first cards.
I resolved the first issues encountered and reported the status of the production to
the FELIX group. At full regime the contractor used two PCs provided by CERN
to do the tests, with a capability to work up to 8 FLX-712s cards at time. The main
problems occurred in the first 20 cards produced were the PCB warping and the
snap-out of the on-board fibers. The others cards produced until now, apart two of
them with problems still under investigation and less than ten with minor issues,
result well functioning. This commissioning will allow the LAr and NSW upgrades
for the ATLAS Phase-I.
After January 2020 I slowly left the FLX-712 production and started a collab-
oration with the Hardware Tracking for the Trigger (HTT) group for the ATLAS
Phase-II TDAQ upgrade. Here my contribution has been and is today the develop-
ment of a firmware design for a proposal for the ATLAS Phase-II TDAQ tracking
trigger. One of the several proposals done as alternative of the AM ASICs is about
exploiting the Hough Transform (HT) tracking algorithm, tuned for the ATLAS re-
quirements, implementing it on commercial FPGAs. The HT is a tracking algorithm
principally used to search straight lines. By applying a space parameter transforma-
tion, it can be tuned to have, at software level, performance of road finding and fake
rejection comparable with the AM ASIC. My work started in February 2020, with
the development of a firmware design to implement the Hough Transform. The re-
search involved all the aspects of the implementation, from the choice of the FPGA
family, the design concept and the various logic and hardware studies to reach the
targeted performance. At the end of May 2020 the design concept, shown in Figure
7.33 , was decided. In the following months the concepts have been translated in
VHDL code, showing the first hardware results in the last month. Figure 7.34 shows
the setup used for this test. The firmware architecture has been, and is now, de-
veloped using the lowest abstraction level of programming, controlling the number
of resources and which to use for each task. This is a necessity to avoid a steep
increase of the resource usage and concurrently maintaining good performance. The
firmware reach in size the limit of the two millions components used in the FPGA.
The hardware test resulting successful suggests that the methodologies applied until
7.3. FPGA IMPLEMENTATION OF HT ALGORITHM 119
Figure 7.33: Overview of the HT firmware logic.
Figure 7.34: Setup of the first hardware test of the HT firmware.
120 CHAPTER 7. HTT ALTERNATIVE SOLUTIONS
now are working fine and can be handled by the device. This architecture is part
of a proposal that will be finalized with a demonstrator in the next months, in the
late spring 2021. The HT firmware is a completely scalable architecture driven by
many parameters, which determine the resources required and the timing closure
of the system. The next steps about this project are many and of crucial impor-
tance. The most important two are to finalize the implementation of the final part
of the firmware logic and to complete the definition of all the parameters. A big
recent achievement was that the ATLAS management decided on December 2020 to
form two task forces to investigate two different solutions for the HTT project, one
custom-hardware based and one hardware-commodity based. Because of the nature
of my personal contribution in the HT firmware, it suits in both the two scenarios,
not being strictly related to custom hardware and with its scalability giving it a vast
range of potential applications.
Appendix A
Aurora 64b/66b
The Aurora 64b/66b is a digital communication protocol using the 64b/66b coding
system, developed for point-to-point communication at high-speed. This protocol
allows a high payload of the coding mechanism, 97 %. The protocol specifies the
characteristics of: the electrical connections, the PMA (Physical Medium Attach-
ment), PCS (Physical Coding Sub-layer), Channel Control and Cyclic Redundancy
Check (CRC). Figure A.1 describes the Aurora main interconnections which are Au-
Figure A.1: Aurora 64b/66b conceptual scheme.
rora lane and Aurora channel, both capable of simplex, half-duplex and full-duplex
functionalities. The Aurora data stream is managed by frame which can act as
data, idles or flow control messages. Figure A.2 describes in details the data flow
chain. Starting from the upper right the user data enters the engine and the Aurora
Encoding apply the flow control and data management. Before reach the GearBox,
the 64 bits word from the encoding undergoes a scrambling operation. These word
can be for data or management. The scrambling is done to distribute more equally
121
122 APPENDIX A. AURORA 64B/66B
Figure A.2: Aurora 64b/66b data stream chain in both directions.
123
the 0 and 1 of the word and consequently of the serial data stream. This operation
is done to let the receiver to use the randomized stream as ”simil-clock” stream,
and to use it to synchronize with the clock receiver source. The scrambled data are
then categorized in data or flow control. There are, further then the data itself to
transmit, 15 types of control blocks available: the direct commands of idles, channel
bonding, not ready and clock compensation. The others are two types of separa-
tors, two types of Flow control, one block reserved and 9 blocks called user k-blocks.
These management data are recognized by the GearBox which will apply to them
the header 10, while the user data will have as header 01. Going through some of
them quickly:
• clock compensation is an application command used to prevent data errors
due too high difference between the recovered clock from the data and the
reference local clock;
• channel bonding is a block used to synchronize and align at the same phase
different lanes of the same channel;
• separators indicate the end of the current frame;
• user k-blocks are customizable stream control block;
• flow control blocks can be ”native” or ”user” and decide the priority of all the
management blocks transmitted and received.
After the 66 bits word is built, the PCS operations are complete and the PMA func-
tionalities start by serializing the data (handled by low or high-speed transmitter as
LVDS or transceiver LVDS). At the receiver the PMA apply a clock data recovered
to reconstruct the clock from the data. This is possible thanks to the scrambling op-
eration done by the transmitter. After the phase alignment the data is de-serialized,
then it enters the Block sync, a IEEE standard state machine to align the 66 bits
words. After the alignment a GearBox remove the header and, intercommunicating
with the Channel Bonding and Clock compensation engines, applies the two ”cali-
bration” functions mentioned above to extract the correct 64 bits user data word.
These two functions are the frame recognition and the clocks alignment:
• the frame recognition is used to find the exact position in the serial stream
from which the 66 bits word starts. To do that it is used the value of the
header of the blocks, which can be 10 or 01. The headers are checked for the
66 bits input de-serialized words. If their values is for a certain amount of
occurrence consecutively (usually 66) 10 or 01, then the starting bit of the
frame is considered found. If not then a shift register is applied, called in this
case bit-slip, moving 1 bit on the Most Significant Bit side or on the Least
Significant Bit side. If more than 66 bit-slip operations are done than an error
signal is send. Figure A.3 shows, on the left side, an example of bit-slip;
124 APPENDIX A. AURORA 64B/66B
• the clock alignment is used to match the phase of the two clock sources of
transmitter and receiver. The Aurora 64b/66b, thanks to its low payload, is
usually used for high bandwidth data streaming. The stability of the flow is
crucial. To assure this stability is applied a rule described, together with a
bit-slip example, in Figure A.3. The operation is to ensure that the edge of
the clock used by the receiver (one edge if single data rate or all the two edges
if double data rate) is placed as close as possible to the center of the serial
bit in input to the receiver. This to avoid that a too-near to the edge start or
end of serial data causes a wrong reading at the receiver (if a data transition
is too close to a clock edge, the jitter could cause the instrument to read
the previous data instead of the next, randomly behaving). This operation
of phase calibration is applied using de-serialized data. After completing the
frame recognition, it is checked that a pre-defined word is received for a defined
amount of time to ensure that the data stream is stable. If this defined data
is not received for enough time then a delay (few ps) is applied to the received
serial data or to the clock, to search the correct phase and achieving the
configuration showed in the lower right part of Figure A.3.
Figure A.3: Left: example of frame searching operation through bit-slip. Right: before
(upper part) and after (lower part) the clock alignment operation.
To conclude, there are two types of error defined basing on their relevance:
hard errors, catastrophic events such as channel disconnection or hardware failure
and soft errors such as wrong header reception after a complete calibration of the
system, possibly caused by statistical issues in long runs and solved with a further
calibration. Hard errors cause a system reset while soft errors are reported but not
cause the data stream stop.
Appendix B
FPGA
A Field Programmable Gate Array is an array of configurable digital and analog
components. The interconnections between them are also configurable, allowing
to design custom circuit with digital and analog input/output capabilities. These
devices have always been associated to DAQ systems, thanks to their capability of
control at 100 % the latency of each circuit developed. They are extremely versatile
in terms of technologies (optical fibers, coaxial cables, ADCs, etc,) to interconnect.
The major features of this technology are:
• the achievement of the 7 nm technology node (few months ago the first com-
mercial FPGA with it has been made available on the market);
• the capability of manage up to 3.5 Tb/s of inputs/outputs;
• all the most general and used I/O technology implemented by default in a
vast range of commercial evaluation boards as ethernet, optical fibers, coaxial
connectors, LCDs, general purpose I/O, PCI Express, HDMI, display ports,
general purpose FPGA Mezzanine Connector (FMC);
• two separate technologies of high-speed I/O for different needs: transceivers
LVDS (up to 32 Gb/s) and low-performing LVDS (up to 1.6 Gb/s);
• a very high number of internal components with various functions as Digital
Signal Processors (up to 12000), Serializer-de-serializer of two technologies
(transceivers and low-performing), several types of memories as non volatile
(ROM) and volatile (RAM), high-versatility memory components as Flip Flops
(Flip Flop up to 8 million), delay generator, clock generator and cleaner as
Phase-Locked Loop, Look-Up Table function generator to implement boolean
processes (up to 4 million);
• a 1 ps resolution of timing analysis of all the developed circuit;
• an internal stable frequency achievable up to 1600 MHz, with up to 800 MHz
usable also in Double Data Rate.
125
126 APPENDIX B. FPGA
FPGAs are divided in banks with their power voltage and digital logic in input
and output. Some banks allow a customization of the voltage swing between 0 and
1 and include also current choice.
• high-range bank: from 3.3 V to 1.2 V (not all possibilities available) I/O with
low performing speed in LVCMOS (single handed) and LVDS (differential
signal), with bandwidth up to 1.25 Gb/s in LVDS mode;
• high-performance bank: from 1.8 V to 1.2 V, single and differential, high-speed
performing up to 1.6 Gb/s in LVDS mode;
• Gigabit Transceiver (types X,H,Y): highest bandwidth performance with high
customization of logic swing (even lower the 300 mV), up to 32 Gb/s LVDS
transceiver mode;
• Clock Management Tiles: clocks manager banks with 1.6 GHz maximum fre-
quency and < 20 % of clock input period of jitter;
Depending from the I/O, from 4 to 24 mA are configurable, as Low Voltage CMOS,
Low Voltage Differential Signal, Low Voltage TTL (Transistor-Transistor Logic),
Differential Signal TTL, etc. Many of these possible parameters and configurations
are selectable based on the type of FPGA. For the XILINX vendor there are three
types of FPGA separated based on the so-called ”speed-grade”, in three different
versions. Moving to the inside of the device, Figure B.1 describes the general scheme
Figure B.1: Scheme of the internal structure of a FPGA.
of the device, where the Configurable Logic Block (CLB) represents the smallest con-
figurable custom circuit. Connect Box (CB) and Switch Box (SB) instead represent
the configurable interconnection between the CLBs. The CLBs are different depend-
ing on the vendor. Generally speaking they contain a set of all the components to
implement internal circuit as: Look-Up Table, Carry for arithmetic operations, Flip
127
Flop, latches and multiplexers. The LUTs are the most peculiar devices. LUTs have
from 2 to 6 ports in input to 1 in output. They are programmed by the compiler
(or by the developer directly) by communicating the operation to apply to all the
inputs with a number from a list of defined accepted values. Up to 500000 CLBs
can be provided by today FPGAs. Together with the CLB other general devices are
distributed in the FPGA to do all the necessary operations:
• Digital Signal Process are devices capable of processing up to three vectors of
bits in several arithmetic operations. The important feature of this compo-
nents is the operational speed, handling a clock frequency up to 891 MHz and
being able to implement all the arithmetic which is capable of in one clock
cycle (even important and useful calculation such as A+B*C);
• Mixed-Mode Clock Manager and Phase-Locked Loop are the clock managers
and cleaners capable of generating from an input clock an output up to 1600
MHz, able to distribute the clock all over the FPGA and handle up to 8
different clocks generated from one input frequency;
• Several types of Buffers from the general buffer to delay a signal to bank
specific buffers to distributes low jitter clocks to three-state manager;
• Double Data Rate output and input, to distribute in defined FPGA areas clock
and signals at DDR precision;
• Block RAM (up to 3000) with up to 72 bits I/O and up to 36 kbits of space
managed in addresses, capable of running at 750 MHz;
• Delay in input and output to dynamically change the phase of clocks and data.
The languages to develop firmware are the Hardware Description Language such
as VHDL, Verilog, TCL. The compilation chain to produce the file to upload in the
management system of the FPGA to configure it is divided in three major actions:
RTL (Register Transfer Level) netlist production, placing of the netlist components
in the chosen device and routing them together. A netlist is the list of all the compo-
nents forming the hardware logic developed and how they are interconnected. After
writing the HDL script the compiler build the netlist basing on the technology and
type of components that FPGA vendor offers. In this operation are not considered
the number of components of the specific FPGA targeted, the speed type, the travel
time of the signal through the components or between them, or in general the tar-
get FPGA specifications. The RTL netlist is used to implement the logic in the
targeted FPGA, considering now all the parameters avoided before. This occurs in
many steps. First of all a physical optimization is applied to simplify the netlist bas-
ing the real device used. Then the netlist achieved is placed in the FPGA, consisting
in all the components placed using Intellectual Property algorithms to enhance the
timing and area performance. After this another physical optimization is applied
and then the routing of everything is done, also here applying IP algorithms. During
128 APPENDIX B. FPGA
the operations hardware constraints as clock frequency in input, paths to treat with
higher priority or to delay, I/O mapping etc is done to complete the compilation
and generate the binary file to program the FPGA with by JTAG protocol. At the
end, a list of all the paths and their real travel time is provided, with the possibility
to use it by several tools to operate software simulations. In all of the parts of
the compilation it is possible to intervene to drive the compiler. This is necessary
because the algorithms applied by it are not 100 % assuring the good end of the
compilation. Some of them are described before in the paragraph 7.2. Area occu-
pied, number of resources, timing closure (explained in the paragraph 7.2), power
supply requested are some of the constraints that a firmware could undergo. FPGAs
can be programmed in different ways by uploading the binary file to reconfigure the
device in SRAM, EEPROM or Flash memory. A focus is required for the XILINX
IP technology of the transceivers. Figure B.2 shows the block scheme of the data
Figure B.2: Detailed scheme of the transceiver XILINX technology.
129
processing chain to transmit and receive information by the transceiver technology.
Many blocks of this Figure have the same purpose of the one showed in the previous
chapter but applied to different coding protocols. In this example, which regards
the GTH Ultrascale transceiver, the 8b/10b and the 128b/130b coding protocols
are handled, including the Pseudo Random Bit Stream (PRBS), a data protocol
used usually for tests such as BER and eye-diagram of the high-speed lanes. The
PIPE (PHY Interface for the PCI Express) Control is used for the PCI Express
message management. The differential signals polarity is cross-checked inside the
same FPGA device, for example for loop-back data stream. In the PMA layer are
described in more detail the hardware components that control the serial stream:
PISO and SIPO represent the serialization-de-serialization and vice versa processes
(Serial input Parallel Output, Parallel Input Serial Output), Clock Dividers are the
high-performing PLLs which multiply the reference clock of several teens to reach
the serial speed required, TX pre and post-emphasis allow to ”open the eye” of
the transmission, Drivers electrically transmit and receive the signals, RX equal-
izer (EQ) acts as adaptive filtering for low power mode (LPM) and high-performing
(DFE, Digital Front End) input signal cleaning and Out-Of-Band (OOB) is used for
the Serial ATA (SATA) protocol.
130 APPENDIX B. FPGA
Appendix C
Boundary Scan and JTAG
FPGAs are chip able to be exploited in many jobs. The standard to communicate
with them from the outside is made by the infrastructure known as Boundary Scan
(BS), which was then called Joint Test Action Group (JTAG) protocol because a
joint of vendors and companies decided to use it as standard for monitoring and
debugging of digital chip. The Boundary Scan is a scan architecture which connects
a register at each I/O outside and inside a device. This approach lets to test each
component of the implemented logic singularly, for example to isolate faulty parts
of the circuit. It’s purpose goes from testing the firmware of a configurable device
to test the hardware itself, for example for broken circuit traces. These tests can be
single shoot of information in the input and checking for the output or pre-saved set
of patterns for more complex testing needs. The BS structure is placed alongside
with the Integrated Circuit or ASIC design. It is made by Boundary-Scan Cells
(BSC), usually one per IC or ASIC pin. The pin signals go freely through the
BSC if it is not active. In boundary mode the I/O signals of the device under test
are intercepted by the BSC. They are all interconnected by a shift register data
stream structure. The testing operations consist on injecting from the BSC output
to the device input the test patterns and read the device outputs by the BSC inputs.
Then the test results are sent to the analysis tool. This single cell structure then is
multiplied to monitor all the I/Os and internal components I/Os. Figure C.1 shows
the scheme of the BSCs interconnections, where on the left the black lines represent
the device under test signal path in case on normal mode, with the BS inactive. On
the right, in BS mode the Boundary Scan Path represents the shift register stream
which lets all the BSCs communicate. The BS structure can handle more ICs or
ASICs concurrently. The Figure C.1 shows also signals as TDI, TCK, etc. They are
related to the data protocol used to manage all the structure, which are described
below. The architecture is based on different types of registers sharing information:
• Test Access Port (TAP) controller, a 16-states finite state machine to control
the system;
• Bypass register which is made by 1 bit register to bypass freely the test struc-
ture;
131
132 APPENDIX C. BOUNDARY SCAN AND JTAG
Figure C.1: Scheme of the Boundary Scan structure.
• Identification register which is hardwired to identify the device under test;
• User Defined registers, optional for user custom tests.
The operation signals are:
• Test Data Input (TDI) which sends the test patterns and protocol instructions
in input to the BS;
• Test Data Output (TDO) which provides the test results and instructions in
output from the device under test;
• Test Mode Select (TMS) which controls the 16-states finite state machine of
the TAP to control the system;
• Test Clock (TCK) which synchronizes everything in the BS;
• Test Reset (TRST).
The BS or JTAG testing system is now implemented by default in many commercial
devices as FPGA, accessible by USB or 5-pins connectors.
Appendix D
Vivado Eye Diagram
Figures D.1 and D.2 show the logic scheme of the Physical Medium Access (PMA)
Figure D.1: Scheme of the blocks and interconnection which is made the PMA layer of
the Vivado eye diagram building structure.
and Physical Coding Sub-layer (PCS) structures of the eye diagram built by the
Vivado tool. The path starts with the input from the PMA layer. The Equalization
block applies the DFE or LPM modes to reduce the attenuation and distortion of
the signal, caused by for example the cable of transmission. DFE mode is the most
performing but not tuned for low power performance. The other inputs, which will
be than compared with the equalizer output, are the horizontal offset (horz offset)
which delays the sampling time, and the vertical offset (vert offset) which raises or
lowers the differential voltage threshold at which the equalization output is compared
with. vert offset is transformed from its digital values to analog signal to be com-
pared with the equalization output. The clock recovered from the data (rec clock)
drives the two synchronous Capture Flip Flop. Their transmission is used to build
the parallel data (rdata) after the de-serialization done in the pink block in the
133
134 APPENDIX D. VIVADO EYE DIAGRAM
Figure D.2: Scheme of the blocks and interconnection which is made the PCS layer of the
Vivado eye diagram building structure.
135
upper right part of Figure D.1. Concurrently also the serial data (sdata) are sent
in input to this block. rdata and sdata are then transmitted to the PCS structure.
Here rdata is used to search for desired patterns while sdata for error detection. All
the components and signals operate as:
• ”count qualifier” compares rdata with the values from es qualifier (masked if
necessary by es qual mask), sending the results to the sample counter (via a
prescaler) or to the error counter;
• ”error counter” takes note of the errors over time;
• ”sample counter” counts the total number of sampling cycles;
• ”prescaler” adds sub-multiple to ”count qualifier” output so that each incre-
ment of ”sample counter” results in a multiple of ”count qualifier”;
• ”state machine” controls the system driving it to an eye scan or a capture
snapshot of rdata and sdata;
All of these are then sent as input information to the Dynamic Reconfiguration
Port (DRP) system. It is a processor-friendly synchronous interface for dynamic
primitive parameter change and monitoring, to update and check the transceiver
parameters of the HDL code.
136 APPENDIX D. VIVADO EYE DIAGRAM
Bibliography
[1] Evans L. and Bryant P. LHC Machine. JINST 3, S08001 (2008).
[2] Jean-Luc Caron. Cross Section of LHC dipole. Dipole LHC: coupe transversale.
AC Collection. Legacy of AC. Picture from 1992 to 2002, May 1998.
[3] Aad g. et l. The ATLAS Experiment at CERN Large Hadron Collider. JINST
3, S08003 (2008).
[4] ATLAS Collaboration. Observation of a new particle in the search for the Stan-
dard Model Higgs boson with the ATLAS detector at the LHC. Phys. Lett B,
716 (arXiv:1207.7214. CERN-PH-EP-2012-218), 1-29 (2012).
[5] ATLAS Collaboration. Technical Design Report for the ATLAS New Small
Wheel. CERN-LHCC-2013-006. ATLAS-TDR-020 (2013).
[6] ATLAS Collaboration. Technical Design Report for the ATLAS Liquid Argon
Calorimeter Phase-I Upgrade. CERN-LHCC-2013-017. ATLAS-TDR-022-2013
(2013).
[7] ATLAS Collaboration. Technical Design Report for the ATLAS TDAQ system
Phase-I Upgrade. CREN-LHCC-2013-018. ATLAS-TDR-023 (2013)
[8] ATLAS Collaboration. Technical Design Report for the ATLAS Inner Tracker
Strip Detector. CERN-LHCC-2017-005. ATLAS-TDR-025 (2017).
[9] ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of
the ATLAS Muon Spectrometer. CREN-LHCC-2017-017. ATLAS-TDR-026
(2017).
[10] ATLAS Collaboration. Technical Design Report for the ATLAS Liquid Ar-
gon Calorimeter Phase-II Upgrade. CERN-LHCC-2017-018. ATLAS-TDR-027
(2017).
[11] ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of the
ATLAS Tile Calorimeter. CERN-LHCC-2017-019. ATLAS-TDR-028 (2018).
137
138 BIBLIOGRAPHY
[12] ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of the
ATLAS Trigger and Data Acquisition System. CERN-LHCC-2017-020. ATLAS-
TDR-029 (2018).
[13] ATLAS Collaboration. Technical Design Report for the ATLAS Inner Tracker
Pixel Detector. CERN-LHCC-2017-021. ATLAS-TDR-030 (2018).
[14] ATLAS Collaboration. Technical Design Report: A High-Granularity Timing
Detector for the ATLAS Phase-II upgrade. CERN-LHCC-2020-007. ATLAS-
TDR-031 (2020).
[15] Mikael Martensson. A search for leptonquarks with the ATLAS detector and
hardware tracking at the High-Luminosity LHC. ISSN 11651-6214. ISBN 978-
91-513-0707-7 (2019).
[16] Allam Shehata Hassanein et al. A survey on Hough Transform, Theory, Tech-
niques and Applications. IJCSI Vol 12, I 1, No 2, ISSN 1694-0784 (2015).
[17] RD53 Collaboration. The RD53A Integrated Circuit. CERN-RD53-PUB-001.
V 3.51 (2019).
[18] RD53 Collaboration. RD53 Status and Plans Pixel readout integrated circuits
for extreme rate and radiation. 6th LHCC Report. September 11 2019.
[19] ATLAS TDAQ Collaboration. FELIX: The New Readout System for the AT-
LAS Detector. ATL-DAQ-PROC-2019-036 (2019).
[20] ATLAS TDAQ Collaboration. FELIX User Manual: https://atlas-project-
felix.web.cern.ch/atlas-project-felix/user/felix-user-manual/versions/4.0.6/.
[21] ATLAS TDAQ Collaboration. Hardware production quality control for the AT-
LAS Phase-I readout upgrade. PoS TWEPP2019 (2020) 099.
[22] ATLAS Collaboration. Fast pattern recognition with the ATLAS L1Track trig-
ger for the HL-LHC. PoS(Vertex 2016) 069.
[23] XILINX. Vivado Design Suite User Guide: Logic Simulation. UG900. 2020.
[24] XILINX. Vivado Design Suite User Guide: Synthesis. UG901. 2020.
[25] XILINX. Vivado Design Suite User Guide: Implementation. UG901. 2020.
[26] XILINX. UltraScale Architecture GTY transceivers: User guide. UG578 2020.
[27] XILINX. UltraScale Architecture GTH transceivers: User guide. UG576 2020.
[28] XILINX. UltraScale Architecture Configurable Logic Block: User guide. UG574
2020.
BIBLIOGRAPHY 139
[29] XILINX. UltraScale Architecture Memory Resources: User guide. UG573 2020.
[30] XILINX. UltraScale Architecture SelectIO Resources: User guide. UG571 2020.
[31] Association Connecting Electronics Industries. Acceptability of Printed Boards.
IPC-A-600G 2004.
[32] XILINX. Aurora 64b/66b Protocol Specification. SP011 2014.
[33] Xabier Cid Vidal, Ramon Cid Manzano. Taking a closer look at LHC: LHC Pb
collisions.
[34] Rocca P. L. and Roggi F. The upgrade of the experiments at the Large Hadron
Collider. Journal of Physics: Conference Series 515, 012012 (2014).
[35] CMS Collaboration. The CMS experiment at CRE LHC. JINST 3, S08004
(2008).
[36] CMS Collaboration. Observation of a new Ξb baryon. CMS-BPH-12-001 (2012).
[37] CMS Collaboration. Observation of Higgs boson decay to bottom quarks. Phys.
Rev. Lett. 121, 121801 (2018).
[38] ALICE Collaboration. The ALICE experiment at CERN LHC. JINST 3, S08002
(2008).
[39] LHCb Collaboration. The LHCb Detector at the LHC. JINST 3, S08005 (2008).
[40] LHCb Collaboration. Observation of the doubly charmed baryon Ξ++cc . Phys.
Rev. Lett. 119, 112001 (2017).
[41] ATLAS TDAQ Collaboration. FELIX: commissioning of the New ATLAS Read-
out System. 22nd IEEE Real Time Conference, Vietnam, ATL-DAQ-SLIDE-
2020-415 (2020).
[42] ATLAS MUON Collaboration. Irradiation and gas studies of Micromegas pro-
duction chambers for the ATLAS New Small Wheel. ATL-MUON-SLIDE-2020-
315 (2020).
[43] LHCb Collaboration. Observation of J/ψp resonances consistent with pen-
taquark states Λ0b → J/ψK−p decays. Phys. Rev. Lett. 115. 072001 (2015).
[44] Giacomo Levrini. Feasibility study and emulation of Hough transform algorithm
on FPGA devices for ATLAS Phase-II trigger upgrade. 2020.
[45] XILINX. Interlaken 150G v1.6 LogiCORE IP Product Guide. PG212 2017.
[46] Intel. Interlaken (2nd Generation) Intel FPGA IP User Guide. 2020.
140 BIBLIOGRAPHY
After the PhD period
In the first months of 2021 the development of the firmware went forward. It under-
went some conceptual modifications to its features, but still keeping the structure
of the firmware almost the same thanks to its versatility. New simulations studies
using the updated detector geometry have generated new sizes of the Accumulator
to use in the Hough Transform, including new constraints to extract from it valid
roads. Furthermore a large number of new ideas such as new features requested,
new possible physics studies and new alternative solutions using or not the Hough
Transform have been presented as case studies and development in progress. In
these few pages a brief summary of the status of the Hough Transform firmware
known now as the Bologna/Uppsala version, the one presented in this thesis, will
be described.
An important achievement was to complete the tests hardware of the firmware
by implementing all the firmware on an FPGA, the Alveo U250, in separate small
parts, just to test the functionalities. The tests went well. One of the most important
conceptual change applied to the firmware, which does not effect the code or the
FPGA resource usage of a relevant amount, is the use of the information acquired by
the clustering process of the ITk detector’s hits, instead of the SSIDs, this to avoid
redundancy in the data flow. With this the firmware is built to be a possible replace
for the AM ASICs and part of the Data Organizer. A further new conceptual update
has been achieved by the HTT simulation group by new simulation studies with the
up today detector geometry. These new analysis generated new statistics from which
has been extrapolated new sizes of the accumulator. The simulations are still on
going to be fine-tuned, however two major cases are now under development in the
Hough Transform firmware, including the Accumulator size shown in the thesis, are:
• 216 φ0 bins by 216 qA/pt bins;
• 64 φ0 bins by 216 qA/pt bins;
These two values effect positively the resources of the firmware, having a smaller
number of φ0 and in general a smaller Accumulator. Further results of these sim-
ulations are related to a new feature requested to be processed by the firmware,
the discrimination of the roads and relative clusters in slices alongside the Z coordi-
nate. The slicing in Z of the detector has a peculiar structure that can cause more
hits/clusters to be in more than one slice in Z. This causes consequently a vertical
141
142 BIBLIOGRAPHY
increase (even of a factor 20) of the amount of roads that could be found in the
accumulator. Because of this a discrimination in Z also must be applied. The solu-
tion decided to be adopted now is to inject in the firmware the clusters previously
separated in the Z slice at which they are placed, actuating the separation before
the input in the Hough transform. Different solutions in the future are not excluded.
Two versions of the slicing in Z are under study up to now, generated by the new
simulations: 6 or 18 slices. These two and the two new Accumulator sizes both
generates three cases for the development of the processing of a complete event:
• 216 φ0 bins by 216 qA/pt bins and 6 slices in Z: up to 278 clusters in one layer
and averagely 361 roads per event;
• 216 φ0 bins by 216 qA/pt bins and 18 slices in Z: up to 98 clusters in one layer
and averagely 162 roads per event;
• 64 φ0 bins by 216 qA/pt bins and 6 slices in Z: up to 98 clusters in one layer
and averagely 302 roads per event.
These Hough Transform versions are now under validation from the two Task Forces
that, working in parallel in the HTT project, must by the end of May redact two
reports for the two different alternative solutions: one based on custom hardware
and one based on commodity commercial hardware. The status of the firmware
development related to the first version studied and to the new ones have been
presented and are discussed in the weekly meeting ’Hough Transform Discussion’
(Hough Transform Discussion), with the firmware stored and managed in a gitlab
repository (Hough Transform repository). The work goes on, now focused on the
timing closure of the three new versions.
