Development of the FPGA-based Raw Data Preprocessor for the TPC Readout Upgrade in ALICE by Klewin, Sebastian
Dissertation
submitted to the
Combined Faculties of the Natural Sciences and Mathematics
of the Ruperto-Carola-University of Heidelberg, Germany
for the degree of
Doctor of Natural Sciences
put forward by
Sebastian Klewin (M.Sc. Physics)
born in Friedrichshafen, Germany
Oral examination: April 17th, 2019

Development of the FPGA-based Raw Data Preprocessor
for the TPC Readout Upgrade in ALICE
Referees: Prof. Dr. Johanna Stachel
Prof. Dr. Peter Fischer

Entwicklung des FPGA-basierten Rohdaten-Vorprozessors fu¨r die Aufru¨stung
der TPC Ausleseeinheiten bei ALICE
Alice ist eines der vier großen Experimente am Large Hadron Collider (lhc). Es ist das
Schwerionenexperiment und untersucht daher in erster Linie das Quark–Gluon-Plasma. Um
sich auf die Bedingungen von Blei-Blei-Kollisionen mit einer Rate von 50 kHz am lhc nach
der Umru¨stphase 2 (2018–2021) vorzubereiten, wird ein umfangreiches Aufru¨st-Programm
durchgefu¨hrt. Ziel ist es, dass die Zeitprojektionskammern kontinuierlich ausgelesen werden
ko¨nnen. Die enorme Datenrate von 3,7 TB/s, die durch den aufgeru¨steten Detektor erzeugt
wird, muss bereits wa¨hrend der Datennahme um den Faktor 60 reduziert werden. Andernfalls
wu¨rde das Datenvolumen die voraussichtlich verfu¨gbare Bandbreite und Speicherkapazita¨t
u¨berschreiten.
In dieser Arbeit wurde ein Online-Cluster-Finder (cf) fu¨r fpgas entwickelt und imple-
mentiert, der das gesamte Datenvolumen in Echtzeit bereits wa¨hrend der Detektorauslese
verarbeitet. Dies ist der erste Schritt in einer ganzen Reihe von Datenreduktionsschritten,
welcher fu¨r sich genommen bereits einen Kompressionsfaktor von etwa 5 erreicht, indem
nur physikalisch relevante Informationen behalten werden und ein geeigneteres Datenfor-
mat verwendet wird. Zusa¨tzlich zum cf wurde auch die gesamte Datenaufbereitungskette
konzipiert und implementiert. Diese dekodiert den Eingangsdatenstrom, sortiert die ein-
zelnen Kana¨le so um, dass eine Suche von Ladungsclustern u¨berhaupt erst mo¨glich wird
und korrigiert die Detektoreffekte in den Eingangssignalen. Alle implementierten Module
wurden ausfu¨hrlich simuliert, um ihre korrekte Funktionalita¨t nachzuweisen. Damit wurde
die gesamte Datenverarbeitungskette innerhalb des fpgas vorbereitet und validiert.
Development of the FPGA-based Raw Data Preprocessor for the TPC
Readout Upgrade in ALICE
Alice is one of the four major experiments at the Large Hadron Collider (lhc). It is
the dedicated heavy-ion experiment and therefore primarily examines the Quark–Gluon
Plasma. In order to prepare for the running conditions of 50 kHz lead-lead interactions at
the lhc after the Long Shutdown 2 (2018–2021), an extensive upgrade program is carried
out. The goal of the upgrade is a continuous readout of the tpc without the need of a
trigger. It is essential to reduce the enormous data rate of 3.7 TB/s, generated by the
upgraded detector, already during the data taking by a factor of about 60. Otherwise the
data volume would exceed the expected available bandwidth and storage capabilities.
In this thesis, an online Cluster Finder (cf) was developed and implemented for fpgas
which processes the whole data volume in real-time during the read out. This is the first
step in the data reduction sequence which achieves already a factor of about 5 by keeping
only physically relevant information and making use of a better suited data format. In
addition to the cf, also the whole data preparation chain was designed and implemented
to decode the input data stream, to resort the individual channels to allow for cluster
finding and to correct the detector effects in the input signals. All modules which were
implemented were extensively simulated to verify their proper functionality. With this, the
complete processing chain within the fpgas was prepared and validated.
v

Contents
1 Introduction and Motivation 1
1.1 The Large Hadron Collider . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 ALICE at the LHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Inner Tracking System . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Time-Projection Chamber . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Transition Radiation Detector . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Time-Of-Flight detector . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Detector and Readout Upgrades for the LHC Run 3 . . . . . . . . . . . . . 8
2 TPC with a Continuous Readout 9
2.1 Working principle of Time-Projection Chambers . . . . . . . . . . . . . . . 9
2.2 The ALICE Time-Projection Chamber . . . . . . . . . . . . . . . . . . . . 10
2.2.1 The Field Cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 The MWPC Readout Chambers . . . . . . . . . . . . . . . . . . . . 11
2.3 Readout Upgrade of the ALICE TPC . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Upgrade Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 GEM based Readout Chambers . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Upgrade of the Front-End Electronics . . . . . . . . . . . . . . . . . 14
2.3.4 Towards an Online Track Reconstruction . . . . . . . . . . . . . . . 15
2.4 Timeline of the TPC Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Readout Strategy for the ALICE TPC 17
3.1 The TDR Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 The Common Mode Effect . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Application of a loss-less Data Compression . . . . . . . . . . . . . . 20
3.1.3 Change of the Readout Scheme . . . . . . . . . . . . . . . . . . . . . 25
3.2 The SAMPA Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 The GBT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 The Transmission Protocol . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 The SAMPA Data within the GBT Frame . . . . . . . . . . . . . . 30
3.4 The Common Readout Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 The CRU Firmware 39
4.1 A Modular Firmware Concept . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
Contents
4.2 Interfaces to the User Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 GBT Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 DMA Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.3 TTS Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 A 2D Cluster Finder for the TPC 47
5.1 Overview of the TPC User Logic . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Decoding of the GBT Frames . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 A Two-Stage Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.3 Common Mode Calculation . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.4 Baseline Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Cluster Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1 Determining the Cluster Size . . . . . . . . . . . . . . . . . . . . . . 65
5.3.2 The Concept of the Cluster Finder . . . . . . . . . . . . . . . . . . . 68
5.3.3 The Peak Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.4 Width of the Cluster Finder Instances . . . . . . . . . . . . . . . . . 73
5.3.5 The Cluster Finder Module . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.6 Optimising the Cluster Finder Grid . . . . . . . . . . . . . . . . . . 78
5.3.7 Calculating the Cluster Properties . . . . . . . . . . . . . . . . . . . 80
5.3.8 Cluster Merging Network . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Resource Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 Performance and Validation 93
6.1 Performance of the User Logic Modules in Simulation . . . . . . . . . . . . 93
6.1.1 Decoding of the GBT Frames . . . . . . . . . . . . . . . . . . . . . . 94
6.1.2 Sorting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.3 Baseline Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.1.4 The individual Cluster Reconstruction Modules . . . . . . . . . . . . 98
6.1.5 The complete Clusteriser . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Validation during Test Beam Data Taking . . . . . . . . . . . . . . . . . . . 108
6.2.1 The C-RORC as Readout Card . . . . . . . . . . . . . . . . . . . . . 109
6.2.2 The Triggered Readout Mode . . . . . . . . . . . . . . . . . . . . . . 110
6.2.3 Validation of the GBT Decoder . . . . . . . . . . . . . . . . . . . . 110
6.3 Cluster Reconstruction in Software . . . . . . . . . . . . . . . . . . . . . . . 111
7 Conclusion and Outline 115
Appendix 117
A Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B Pad Plane Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
C The Raw Data Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
D Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
E Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
viii
List of Figures
1.1 The cern accelerator complex . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 A schematic layout of the alice detector . . . . . . . . . . . . . . . . . . . 5
2.1 Schematic view of the alice tpc . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Cross-section of a mwpc based roc . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Simulation of the gas amplification of two electrons in a gem hole . . . . . 13
2.4 Energy resolution and ibf for different gem voltage settings . . . . . . . . . 14
3.1 The main components of the tpc readout chain . . . . . . . . . . . . . . . . 18
3.2 Schematic drawing of the origin of the cm effect . . . . . . . . . . . . . . . 20
3.3 Remaining bias of the adc value and additional cm noise after correcting
with the averaging method . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Example of a small Huffman tree with only five codes . . . . . . . . . . . . 22
3.5 The probability spectrum of the adc values of the alice tpc . . . . . . . . 22
3.6 Compression factors achieved by the length-limited Huffman . . . . . . . . . 24
3.7 Compression factor of the Huffman encoding as function of the occupancy . 25
3.8 Block diagram of the sampa asic . . . . . . . . . . . . . . . . . . . . . . . . 26
3.9 The sampa synchronisation pattern in das mode . . . . . . . . . . . . . . . 28
3.10 The gbt protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.11 Mapping of the sampa data into the gbt frame . . . . . . . . . . . . . . . . 32
3.12 Image of the cru version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.13 Layout of the alice underground area . . . . . . . . . . . . . . . . . . . . . 37
4.1 The main blocks of the cru fw . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Synchronisation register chain . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 The transmission protocol towards the data path wrapper . . . . . . . . . . 44
4.4 Mapping of the data bus into the flp memory . . . . . . . . . . . . . . . . 44
5.1 A simplified block diagram of the tpc ul . . . . . . . . . . . . . . . . . . . 48
5.2 Detailed mapping of the sampa output into the gbt frame . . . . . . . . . 50
5.3 Block diagram of the gbt decoder . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 The four valid adc clock sequences . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 The four possible hw sequences . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Data interface of the gbt decoder . . . . . . . . . . . . . . . . . . . . . . . 53
5.7 Excerpt of the iroc pad plane with the sampa channels . . . . . . . . . . . 56
5.8 Illustration of a sorting network . . . . . . . . . . . . . . . . . . . . . . . . . 57
ix
List of Figures
5.9 A configurable pre-sorting module based on a ping-pong ram . . . . . . . . 59
5.10 Block diagram of the cm calculation module . . . . . . . . . . . . . . . . . . 61
5.11 Block diagram of the Arria10 native fp dsp ip core in three different
configuration modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.12 Block diagram of the blc module . . . . . . . . . . . . . . . . . . . . . . . . 64
5.13 Extension of the clusters in time direction as a function of the drift length . 67
5.14 General concept of the cluster finding approach . . . . . . . . . . . . . . . . 69
5.15 The definition of a charge peak . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.16 Maximum number of peaks within a cf instance . . . . . . . . . . . . . . . 73
5.17 Used fraction of available ccs as a function of pads and time bins per cf . 75
5.18 Block diagram of the cf module . . . . . . . . . . . . . . . . . . . . . . . . 76
5.19 Reading order of the cf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.20 The cf grid covering all required pads . . . . . . . . . . . . . . . . . . . . . 79
5.21 The Cluster Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.22 Block diagram of the cp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.23 Mapping of the cluster data to the cp input fifo . . . . . . . . . . . . . . . 85
5.24 The fifo merging network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.25 Visualisation of the utilisation of the fpga . . . . . . . . . . . . . . . . . . 92
6.1 Simulation of a complete gbt decoder with ModelSim . . . . . . . . . . . . 95
6.2 The content of the pre-sorter mapping files . . . . . . . . . . . . . . . . . . 97
6.3 Setup of the row-segment merger simulation test bench . . . . . . . . . . . . 98
6.4 Simulation of the blc module with ModelSim . . . . . . . . . . . . . . . . . 99
6.5 Distributions of the cluster properties for the simulation system with 23 cf
instances, an occupancy of 1 % and a cut on qmax and the contribution
threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6 Distributions of the cluster properties for the simulation system with 23 cf
instances, an occupancy of 30 % and a cut on qmax and the contribution
threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.7 Efficiency of the clusteriser . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.8 Input and results of the full clusteriser simulation . . . . . . . . . . . . . . . 107
6.9 Filling level of the four final fifo mergers as a function of time . . . . . . . 108
6.10 Image of the c-rorc with its major components . . . . . . . . . . . . . . . 109
6.11 Spectrum of the adc values recorded at the test beam . . . . . . . . . . . . 111
6.12 The Callgrind output of the software cf, visualised with KCachegrind . . . 112
B.1 The pad plane of the iroc . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.2 The pad plane of the oroc 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.3 The pad plane of the oroc 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 125
B.4 The pad plane of the oroc 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 126
C.1 The rdh version 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
x
List of Tables
3.1 The two different transmission modes of the sampa in das mode . . . . . . 27
3.2 Bits of the gbt frame of the individual data groups in wide bus mode . . . 31
3.3 Resources of the Intel Arria10 fpga of the cru . . . . . . . . . . . . . . . . 34
5.1 Content of the gbt decoder data output . . . . . . . . . . . . . . . . . . . . 54
5.2 Diffusion coefficients and drift velocities for electrons in different gas mixtures 66
5.3 Geometric parameters of the pad plane of a tpc sector . . . . . . . . . . . . 68
5.4 Content of the data word stored in the cf memory . . . . . . . . . . . . . . 72
5.5 Sequence of the cf output data stream . . . . . . . . . . . . . . . . . . . . . 78
5.6 Mapping of the cluster data to the dsp ports . . . . . . . . . . . . . . . . . 86
5.7 Fit result summary of the complete fw with the tpc ul . . . . . . . . . . . 88
5.8 Resource consumption of the individual modules of the tpc ul . . . . . . . 89
B.1 Key parameters of the pad plane regions . . . . . . . . . . . . . . . . . . . . 122
xi

Chapter 1
Introduction and Motivation
A few microseconds after the formation of the early universe, it is assumed that the medium
was in a very high temperature and density state [1]. Under such conditions, matter is
expected to exist in a state called Quark–Gluon Plasma (qgp), where the quarks and gluons
are deconfined [2]. The only known way to generate this state of matter in a laboratory
in order to study it is to collide heavy nuclei at ultra-relativistic energies. One of the few
places where this is possible is the Large Hadron Collider (lhc) at cern (Conseil Europe´en
pour la Recherche Nucle´aire, European Organisation for Nuclear Research) [3].
During the first years of lhc operation, the alice (A Large Ion Collider Experiment)
collaboration was able to confirm the basic picture of the creation of a strongly interacting
matter at values of temperatures and densities never seen before in a heavy-ion collision [3].
In order to continue the studies of the qgp by examining its properties such as viscosity,
transport coefficients or the temperature evolution, alice will focus on rare probes after
the Long Shutdown (ls) 2. These rare probes are e.g. heavy-flavour particles and their
coupling to the medium. By measuring the azimuthal anisotropy for charmed mesons and
baryons as well as beauty particles one can draw conclusions whether the heavy quarks
thermalised in the system and thus participated in the collective flow. Another probe is
the measurement of jets and their correlation with other probes. The properties of the qgp
can be accessed by investigating the energy loss of hard scattered partons in the strongly
interacting medium. Measuring the production of different quarkonium states down to
very low transverse momentum will help to understand the underlying mechanisms of the
formation of those states. The bulk properties and the space–time evolution of the hot and
dense medium can be accessed by measurements of low-mass dileptons. It is possible to
detect the electromagnetic radiation which is produced during all stages of the heavy-ion
collision either as a real photon or a dilepton pair. Since they do not interact strongly
with the medium, they carry information about the entire evolution of the system [3]. The
ls 2, which gives the experiments the opportunity to upgrade their detectors, started in
December 2018 and will end in early 2021 [4].
The excellent tracking performance even in a high-multiplicity environment, as well
as the Particle Identification (pid) capabilities over a large momentum range due to the
individual detector systems of alice are unique at the lhc. However, for many of the
physics topics a high event statistics is required for precision measurements. Since many of
1
Chapter 1 – Introduction and Motivation
the anticipated measurements require complex probes at very low transverse momentum on
which a traditional triggering approach is not applicable to enhance the statistics, the alice
experiment will undergo a high-rate detector upgrade. After this upgrade, the experiment
will be able to examine practically all heavy-ion collisions delivered by the lhc at a rate of
50 kHz. This is a very different approach compared to the current readout strategy and the
one of the other lhc experiments, where possibly interesting events are selected by means
of a trigger [3]. With the detector upgrade, alice will improve its low-momentum vertexing
and tracking capabilities to be able to measure also complex probes at low transverse
momentum. The data taking rate will be significantly increased respectively developed
into a continuous one without the need of a dedicated trigger. Otherwise, the performance
especially the identification of charged particles will be preserved [3].
The upgrade program for the alice Time-Projection Chamber (tpc) consist of a
new detector readout, including new Readout Chambers (rocs) based on Gas Electron
Multiplier (gem) technology and a complete new Front-End Electronics (fee), allowing
for a continuous data taking. The continuous readout results in a huge data volume of
3.7 TB/s which must be reduced already during the read out by a factor of about 60.
Otherwise it would exceed the expected available bandwidth and storage capabilities of the
experiment [5]. The reduction will be done by reconstructing the detector signals online
and writing only already processed data to the permanent storage instead of raw detector
signals. An essential part of the online reconstruction is the cluster finding, where charge
clusters have to be found in the data stream of the tpc before a tracking can be applied.
The development of this Cluster Finder (cf) which will run on Field Programmable Gate
Arrays (fpgas) as part of the readout chain was the main topic of this thesis, together
with all necessary data preparation steps that must be applied beforehand.
The thesis is divided into seven chapters. The remaining chapter 1 introduces the lhc
and provides a short description of the alice experiment, together with a few upgrade
examples of the detectors. Chapter 2 is dedicated to the description of the tpc and a
more detailed explanation of the individual upgrade tasks for this detector to achieve the
continues readout, together with an estimated timeline. The following chapter 3 explains
the readout chain of the upgraded tpc with its individual components and presents the
reasons for a change in the readout strategy with respect to the one described in the
Technical Design Report (tdr) for the upgrade [5]. This is followed by a description of
the general design concept of the Firmware (fw) for the Common Readout Unit (cru) in
chapter 4. This card will be used to receive the data from the fee and is responsible for
the first preprocessing steps in the readout chain. Chapter 5 contains the description of the
individual modules which are needed in the fw to successfully reconstruct 2-dimensional
charge clusters in the fpga. This includes all the data preparation steps like decoding,
sorting and a Baseline Correction (blc), as well as the peak finding, calculation of the
cluster properties and concludes with the overall resource consumption of the logic in
the chip. In chapter 6, the validation of the previously described modules is presented
in simulation and, whenever possible, in addition in a real readout system which was
used during a test beam campaign in 2017. The cf, meaning the peak finding and the
calculation of the cluster properties, was also implemented in software which is described
in the last part of this chapter. The final chapter 7 summarises the achievements of this
thesis and gives a short outlook to possible future developments.
2
1.1 – The Large Hadron Collider
LINAC 2
North Area
LINAC 3
Ions
East Area
TI2
TI8
TT41TT40
CLEAR
TT2
TT10
TT66
e-
ALICE
ATLAS
LHCb
CMS
SPS
TT20
n
p
p
RIBs
p
1976 (7 km)
ISOLDE
1992
2016
REX/HIE
2001/2015
IRRAD/CHARM
BOOSTER
1972 (157 m)
AD
1999 (182 m)
LEIR
2005 (78 m)
AWAKE
n-ToF
2001
LHC
2008 (27 km)
PS
1959 (628 m)
2011
2016
2015
HiRadMat
GIF++
CENF
p (protons) ions RIBs (Radioactive Ion Beams) n (neutrons) –p (antiprotons) e- (electrons)
2016 (31 m)
ELENA
LHC - Large Hadron Collider // SPS - Super Proton Synchrotron // PS - Proton Synchrotron // AD - Antiproton Decelerator // CLEAR - CERN Linear 
Electron Accelerator for Research // AWAKE - Advanced WAKefield Experiment // ISOLDE - Isotope Separator OnLine // REX/HIE - Radioactive 
EXperiment/High Intensity and Energy ISOLDE // LEIR - Low Energy Ion Ring // LINAC - LINear ACcelerator // n-ToF - Neutrons Time Of Flight // 
HiRadMat - High-Radiation to Materials // CHARM - Cern High energy AcceleRator Mixed field facility // IRRAD - proton IRRADiation facility // 
GIF++ - Gamma Irradiation Facility // CENF - CErn Neutrino platForm
2017
The CERN accelerator complex
Complexe des accélérateurs du CERN
Figure 1.1: The cern accelerator complex. The lhc is the last step in a cascade of
accelerators, starting with the linac 2 for protons and linac 3 for ions. Protons are
further accelerat d by the psb whil the ions go to th leir. Both continue th n with
the ps nd the sps before finally reaching th lhc, tak n from [7].
1.1 The Large Hadron Collider
alice is one of the four major experiments of the lhc. It is located at cern near Geneva,
Switzerland. Founded in 1954, cern has been pursuing its mission to enable research in fun-
damental physics and to push the frontiers of science and technology. The lhc is the worlds
largest collider and reaches the highest energies ever achieved of 13 TeV in proton-proton
(pp) and 5.02 TeV/nucleon in lead-lead (PbPb) collisions. It is a superconducting hadron
accelerator and collider and was installed in the already existing 26.7 km tunnel of the pre-
vious Large Electron Positron (lep) machine. The lep tunnel lies between 45 m and 170 m
below the surface and its plane has an inclination of 1.4 %, sloping towards the lake Geneva.
The ring consist actually of eight straight sections and eight arcs in which 1 232 supercon-
ducting dipole magnets are build, providing a nominal magnetic field of up to 8.33 T to keep
the beams on track [6]. The lhc is the last piece in a series of successive accelerators, each
increasing the energy of the particles further. This accelerator chain is shown in figure 1.1.
Protons start from a simple hydrogen gas bottle after which an electric field is used to
strip the electrons from the hydrogen atoms to get the naked protons. Those are then
accelerated by a linear accelerator (linac 2) to 50 MeV after which they are injected into the
Proton Synchrotron Booster (psb) to increase the energy to 1.4 GeV. The psb is followed
by the Proton Synchrotron (ps) to accelerate the protons to 25 GeV. The last step before
the lhc is the Super Proton Synchrotron (sps) to increase the energy further to 450 GeV.
3
Chapter 1 – Introduction and Motivation
In the lhc the protons are then accelerated to the final energy of up to 6.5 TeV per particle
beam. The ions on the other hand take a slightly different path. The lead-ions are generated
from a vaporised solid lead block and enter linac 3 for a first acceleration, followed by the
Low Energy Ion Ring (leir) which pushes them from 4.2 MeV to 72 MeV. From here on,
they are injected into the ps and follow the same path as the protons to end in the lhc [8].
The particle beams are brought to collision at four interaction points in the lhc, around
which the four major experiments are build to study the reaction products of the collisions.
These four experiments are:
ATLAS (A Toroidal LHC ApparatuS): A general purpose detector located at interaction
point 1. The detector was designed for the search of the, by now discovered, Higgs
boson [9, 10]. Also the search for hints of new physics beyond the standard model,
such as decays of supersymmetric particles, and extra dimensions are part of the
physics program [11].
ALICE (A Large Ion Collider Experiment): The dedicated heavy-ion experiment at the
lhc. The detector is installed at interaction point 2 and was designed for the detection
and the investigation of the qgp [12]. A more detailed description is given in the
following section.
CMS (Compact Muon Solenoid): A general purpose detector as well, installed at interac-
tion point 5. The objectives are similar to those of atlas but with different detector
techniques for a complementary measurement [13].
LHCb (LHC beauty): The dedicated experiment for heavy flavour physics, located at
interaction point 8. The experiment is build to look at cp violation and rare decays
of bottom and charmed hadrons to find indirect evidence of new physics. Since these
hadrons are predominantly produced at small opening angles with respect to the
beam pipe in pp collisions. The construction of the experiment differs strongly from
those of the other major experiments: lhcb is build as a single-arm spectrometer [14].
Three more, but smaller experiments can be found at the lhc. Lhcf (lhc forward) is
located on both sides of interaction point 1 with a distance of approximately 140 m to the
collision point. It measures neutral particles, emitted in the very forward directions [15].
Totem is located along the beam pipe at distances of ±147 m and ±220 m from interaction
point 5. It measures the total cross-section in pp collisions and studies the elastic and
diffractive scattering independent of the luminosity [16]. The last one is the moedal
(Monopole and Exotics Detector At the LHC) with the task to look for magnetic monopoles
in a direct search. The experiment is build at interaction point 8 [17].
1.2 ALICE at the LHC
As the dedicated heavy-ion experiment at the lhc, alice is designed to withstand the
high particle density environment, expected at such collisions. The high granularity allows
to measure charged particle multiplicities of up to dNch/dy ≈ 8000 from a minimum
transverse momentum of pT ≈ 0.15 GeV/c and, due to the individual sub-detector concepts
a pid is possible for up to 20 GeV/c [18].
4
1.2 – ALICE at the LHC
Figure 1.2: A schematic layout of the alice detector, showing the individual detectors.
On the left side is surrounded by the red L3 solenoid the central barrel and on the right
side is the muon spectrometer, taken from [19].
The alice detector is divided into the central barrel, fully contained within a solenoid
magnet, and the muon arm on the c-side. This layout is shown in figure 1.2. The a-side is
the detector part where the anti-clockwise circulation particle beam moves to, whereas the
c-side is the opposite half, where the clockwise circulating beam moves to. The solenoid
provides a magnetic field of 0.5 T and was inherited from the L3 experiment at lep. The
muon spectrometer starts also inside the central barrel with the hadron absorber, followed
by the first two muon tracking stations. A third tracking station is enclosed in a dipole
magnet, next to the solenoid. This is followed by two more tracking stations and iron
absorber wall. Behind this absorber are the two muon trigger chambers. The beam pipe
which has to go through the whole spectrometer is surrounded by low angle absorbers.
The detectors of the central barrel are dedicated for tracking and pid. From inside out,
there is first the Inner Tracking System (its) surrounding the interaction point. The
its is enclosed by the tpc and the Transition Radiation Detector (trd), followed by the
Time-Of-Flight (tof) detector, all having a full azimuthal coverage. Further outside with a
reduced coverage there are three calorimeters, the ElectroMagnetic CALorimeter (emcal),
the Di-jet CALorimeter (dcal) and the PHOton Spectrometer (phos), and a Ring-imaging
Cherenkov detector (hmpid, High Momentum Particle IDentification). There are more
detectors for the event characterisation, t0, v0 and the Zero Degree Craorimeter (zdc).
The Photon Multiplicity Detector (pmd) and Forward Multiplicity Detector (fmd) are
used for particle multiplicity measurements. The Alice COsmic Ray DEtector (acorde),
which is located on top of the L3 magnet. It allows to trigger on muons from cosmic ray
showers. In the following, the four main detectors of the central barrel are described in
more detail.
5
Chapter 1 – Introduction and Motivation
1.2.1 Inner Tracking System
The its is the innermost detector of alice. It combines three different technologies of
silicon detectors in six cylindrical layers covering a radius from 4 cm to 44 cm. From the
inside out there are first two layers of Silicon Pixel Detector (spd), then two layers of
Silicon Drift Detector (sdd) and then two layers of Silicon Strip Detector (ssd). Its main
task is to provide the primary vertex reconstruction with a resolution of better than 100µm.
In addition, the sdd and ssd can be used for pid via the dE/dx signal of the traversing
particles due to the analog readout of the four outer layers [12].
1.2.2 Time-Projection Chamber
The tpc is the main tracking and pid device of alice. It consists of a cylinder whose
axis is aligned with the beam axis and ranges from an inner radius of about 85 cm to an
outer radius of about 250 cm. The cylinder has a length in z-direction of 500 cm, giving an
active volume of ∼87 m3. The tpc covers the full azimuth and a pseudorapidity range of
|η| . 0.9 for full tracks. The pseudorapidity is defined as
η = − ln
[
tan
(
θ
2
)]
, (1.1)
with the polar angle θ between the positive z and y-axis. The z-axis points from the
collision point towards the a-side of the detector, while the y-axis points upwards, to the
surface. For completeness, the x-axis points horizontally towards the centre of the colliders.
The volume of the tpc is split by a central electrode at 100 kV into two drift regions.
This drift field of 400 V/cm leads to a drift time of 94µs for the electrons for the maximum
drift length. The rocs of the tpc are installed on both end plates of the cylinder. During
Run 1 (2009–2013) and Run 2 (2015–2018) of the lhc their design was based on Multi-Wire
Proportional Chambers (mwpcs) with a cathode pad readout. The replacement of those
chambers by new ones based on gem technology is the essential part of the upgrade program
and described in chapter 2. The end plates are subdivided into 18 trapezoidal sectors, each
covering 20° in azimuth. Since the track density depends on the radius, the requirements
for the rocs differ as a function of the radius. In order to meet these different requirements,
each sector is divided into two separate chambers, an Inner Readout Chamber (iroc)
which extends from 84.1 cm to 132.1 cm and an Outer Readout Chamber (oroc) from
134.6 cm to 246.6 cm [20]. The basic working principle of a tpc is described in section 2.1.
1.2.3 Transition Radiation Detector
The trd is the next detector of the central barrel in radial direction. It covers the full
azimuth as well and a pseudorapidity range of |η| < 0.84. The subdivision into 18 sectors,
the so called Super-Modules (sms), was done to follow the segmentation of the tpc. Each
sm consists of five stacks in z-direction, each stack of six layers of trd chambers. This
sums up to a total number of 522 individual chambers since the central stack of three sms
were kept empty to reduce the material budget in front of the phos detector. The layers
are arranged at a radial distance between 2.90 m and 3.68 m from the beam axis. Each
6
1.3 – ALICE at the LHC
chamber consists of a foam/fibre radiator, followed by a 3 cm drift region and a Xe-CO2
filled mwpc. The electronics to read out the individual pads is directly mounted on top of
the chambers. The radiator is used to generate the Transition Radiation (tr) photons,
which are produced when a particle crosses the boundary between two media with different
dielectric constants. Since the production probability of the tr at a single boundary is
very low — in the order of the fine structure constant α = 1/137 — many boundaries are
needed for a significant photon yield. To absorb the tr photons efficiently, which extend
into the x-ray domain due to the high γ factor∗ of the highly relativistic particles, the
chamber is filled with a high-Z gas (Xe).
By having a sufficiently high sampling rate, a temporal information is extracted, re-
presenting the depth in the drift volume at which the signal is created. With that, the
signal of the tr photons, which are preferably absorbed at the entrance of the chamber
near to the radiator, can be distinguished from the signals resulting from the specific
energy loss of the charged particle. This is uniformly distributed along the crossing particle
trajectory.
To separate electrons from other charged particles, in particular pions, the presence of
the tr signal respectively the absence of the signal for other particles due to a lower γ
factor can be used. An additional criterion for the separation is also the higher dE/dx
signal for electrons due to the relativistic rise of the specific energy loss. The overall
momentum resolution of the tracking in the alice central barrel is improved by including
the six extra space points of the trd at larger radii. Thanks to the fast readout and the
online reconstruction capabilities, the trd has also been used to trigger on electrons with
a high transverse momentum and on jets [21].
1.2.4 Time-Of-Flight detector
Also the tof detector adapts the segmentation into 18 sectors of the tpc and covers
the full azimuth. It is located adjacent to the trd at radii of 370 cm to 399 cm from the
beam axis. As the other central detectors of alice, it covers a pseudorapidity range of
|η| . 0.9. The utilised detector technique is that of a Multi-gap Resistive-Plate Chamber
(mrpc). The individual chambers consist of resistive glass plates with a high and uniform
electric field, due to an applied voltage of 13 kV. Any initial ionisation, generated by a
crossing charged particle, starts immediately an avalanche leading to the observed signal
on the pick-up electrodes. Since there is no electron drift involved, the time jitter is
caused only by the fluctuations in the growth process of the avalanche [12]. With this
mrpc technology, a time resolutions of better than 50 ps is achieved [22]. By calculating
the time-of-flight between the tof signal and the collision time — either measured by
the t0 detector, the lhc central timing or by tof itself if enough tracks are present —
a pid can be done since the flight time is characteristic for each particle species at a
given momentum.
∗The Lorentz factor is defined as γ =
√
1 +
(
p
m0·c
)2 = Etot
m0
, where c is the speed of light, p the
momentum, m0 the mass and Etot the total energy of the particle.
7
Chapter 1 – Introduction and Motivation
1.3 Detector and Readout Upgrades for the LHC Run 3
To be able to read out the all PbPb events at an interaction rate of 50 kHz, as it is foreseen
for lhc Run 3, and in order to improve the vertexing and tracking capabilities of alice,
several detector upgrades are foreseen. Those upgrades will be done during the ls 2,
ranging from December 2018 to spring 2021. The upgrade program starts with a new
beam pipe with a smaller diameter, continues with a completely new, high-resolution and
low-material its and a major upgrade of the tpc by replacing all mwpc based rocs with
new ones employing the gem technology. The readout of trd, tof, phos and the muon
spectrometer are upgraded and optimised for the very high rate. The forward trigger
detectors are upgraded as well and the software framework for the online systems, the
oﬄine reconstruction and the physics analysis is completely rebuild [3]. It is also foreseen
to add the new Muon Forward Tracker (mft) to the acceptance of the muon spectrometer.
This is a silicon pixel detector placed in front of the absorber which increases the pointing
accuracy of the extrapolated muon tracks to reliably measure their offset with respect to
the primary vertex of the collision [23].
One of the key ingredients in improving the general impact parameter resolution of the
experiment is the new beam pipe with a reduced radius from previously 29 mm to 18.2 mm.
This allows to move the innermost layer of the new its closer to the interaction point to a
radius of 22.4 mm [24]. This together with a strongly reduced material budget by utilising
the Monolithic Active Pixel Sensors (maps) technology (radiation length of only 1.1 % X0
per layer) and an all-pixel its with now seven layers at radial distances of 22.4, 30.1 and
37.8 mm for the inner barrel and 194.4, 243.9, 342.3 and 391.8 mm for the outer barrel will
improve the track position resolution at the primary vertex by a factor three or better [25].
The exchange of the rocs for the tpc allows for a continuous readout. With this it
will be possible to record all events of the 50 kHz PbPb collision rate but requires new
electronics as well. The upgrade will be described in more detail in chapter 2.
The trd fee can only be operated in a triggered, single event readout mode. Im-
plementing a continuous readout would mean to exchange these electronics, requiring a
complete disassembly and reconstruction of all the 18 sms to replace the front-end boards
mounted on each chamber. Since this is not realistically feasible, a different approach will
be implemented. A reduction in event readout time by reducing the data volume will
increase the possible trigger rate of the trd. By transmitting only the so called tracklets
(preprocessed track segments in a single detector chamber) instead of the raw data, it will
be possible to record more than 70 % of the events in the 50 kHz PbPb collision scenario.
Besides that, no issues concerning the detector stability or space charge effects due to the
increased multiplicity are expected [26].
The tof upgrade program foresees only changes in the readout tree, too. Measurements
have shown that the intrinsics rate capabilities of the mrpcs is not an issue and that they
can easily sustain the expected particle fluxes of 100 Hz/cm2. The current tof readout
is already able to work with a trigger rate of tens of kHz. However, the upgrade aims to
further increase this limit for both, PbPb and pp interactions [3].
8
Chapter 2
TPC with a Continuous Readout
This chapter covers first the basic working principle of tpcs. Afterwards the alice tpc is
introduced as it was used during Run 1 (2009–2013) and Run 2 (2015–2018) of the lhc.
The readout upgrade of the tpc is presented together with the timeline at the end of this
chapter, to put the developments of this thesis into context.
2.1 Working principle of Time-Projection Chambers
A charged particle with high enough momentum traversing the active volume of a tpc will
ionise the atoms and molecules on its path. Due to an applied electric field, the drift field
~Ed, the electrons will move towards the Readout Chambers (rocs) and the ions in the
opposite direction, towards the central cathode. After arrival in the rocs the electrons
will be amplified and the generated signal can be read out. The x and y-coordinates of the
arriving electrons is then determined by the location of arrival e.g. via a segmented pad
plane. The z-coordinate is calculated from the arrival time ta. By knowing the time t0
when the charged particle crossed the tpc, for example by an external trigger detector,
and by knowing the drift velocity vd of the electrons in the used medium, the coordinate in
z-direction can be calculated with z = (ta − t0) · vd. If a sufficient number of space points
are obtained, the 3-dimensional trajectory of the incident particle can be reconstructed.
A tpc can also have Particle Identification (pid) capabilities via the signal amplitude.
If the read out signal is proportional to the originally deposited charge, the particle can be
identified via the specific energy loss dE/dx, which is described by the Bethe-formula [27],
in combination with the momentum information. Therefore, it is beneficial to place such a
tpc in a magnetic field ~B. First, the trajectories of the charged particles will be curved due
to the Lorentz force. By measuring the curvature of the trajectory the particles momentum
can be obtained directly with the tpc. Second, a magnetic field in parallel to the drift
field ~B ‖ ~Ed reduces the diffusion in transverse direction of the electron cloud along the
drift path and has therefore a focusing effect [28]. Orienting the two fields parallel to each
other has also the advantage that the Lorentz force does not act in the drift direction,
only perpendicular.
A disadvantage of tpcs for high rate experiments is the long drift time of the electrons
in combination with a possible additional dead time imposed by an ion capturing process.
9
Chapter 2 – TPC with a Continuous Readout
8 The ALICE Collaboration
Figure 2.1: Schematic view of the ALICE TPC.
Figure 2.2: View of one of the endplates of the TPC; the different types of rods are indicated.
Figure 2.1: Schematic view of the alice tpc, taken from [5].
The electrons which were generated close to the cathode must drift through the complete
volume to the end caps. Depending on the length and the used material, this can easily
take tens of µs. The ions which are created in the electron amplification process may
have to be captured to avoid an accumulation of charges in the drift volume. This can
e.g. be done with a gated readout introducing a not negligible detector dead time of tens
to hundreds of µs due to the slow ion mobility. Crucial for the operation of a tpc is the
control of the environmental conditions. It must be ensured that the drift velocity stays
constant to be able to reconstruct the z-position with a high resolution. If the pressure or
the temperature changes, this must be taken into account. Also the electric field must be
uniform over the full volume of the tpc.
2.2 The ALICE Time-Projection Chamber
A layout of the alice tpc is shown in figure 2.1. As already mentioned in subsection 1.2.2
the general structure is a hollow cylinder with an outer radius of about 250 cm and an inner
one of about 85 cm. The overall length in z-direction is 500 cm, evenly split into two drift
regions by the central electrode, each with a length of about 250 cm. The two endplates
are divided into 18 sectors, each equipped with an Inner Readout Chamber (iroc) and an
Outer Readout Chamber (oroc), indicated by the trapezoidal shapes well visible on the
right hand side of the figure.
2.2.1 The Field Cage
To provide a uniform electrostatic field in the drift volume, a field cage is used. In addition
the cage provides a stable mechanical structure for the individual detector elements, like
10
2.2 – The ALICE Time-Projection Chamber
they can be set to an appropriate potential to minimize the dis-
tortions of the field. A temperature sensor (PT1000) is glued on
the back side of each skirt sector, thus allowing for temperature
measurements inside the volume of the TPC.
2.6. Endplates
The function of the endplates is to align the cylinders for the
field cage vessels and to hold the readout chambers in position.
The four cylinders are screwed to the flanges that connect the
field cage vessels and the containment vessels, and are made
gas-tight with O-rings. The aluminum structure of the endplate
is 60 mm thick and the spokes are 30 mm wide. The cut-outs
for the readout chambers are equipped with provisions for the
alignment of the chambers relative to the central electrode and
are independent of the endplate itself (see Sec. 3). Gas tightness
is achieved by a sealing foil and a double O-ring; one on the
chamber and one on the endplate. The endplates also provide
feed-throughs and flanges for gas, laser and electrical connec-
tions.
2.7. I-bars
The TPC is installed at an angle of 0.79 degrees with respect
to the horizontal due to the inclination of the LHC accelerator
at the ALICE collision hall. This puts a gravity load on the
endplates and leads to a displacement of the inner field cage
with respect to the outer field cage. The elastic deformation of
the endplates is removed by pulling on the inner field cage with
a pair of I-bars. In Fig. 4, the I-bars are shown attached on the
right hand side of the TPC and were designed so that they do not
obstruct the area around the beam-pipe. The I bars are attached
to the outer ring of the endplate and can push or pull on the
inner field cage ring in order to re-align the field cages. During
assembly in the ALICE detector, it was necessary to pull on the
inner field cage with a force of 3 kN and an alignment of about
150 µm was actually achieved.
3. Readout chambers
3.1. Design considerations
Large-scale TPCs have been employed and proven to work in
collider experiments before [9], but none of them had to cope
with the particle densities and rates anticipated for the ALICE
experiment [5, 6].
For the design of the Read-Out Chambers (ROCs), this leads
to requirements that go beyond an optimization in terms of mo-
mentum and dE/dx resolution. In particular, the optimization
of rate capability in a high-track density environment has been
the key input for the design considerations.
The ALICE TPC has adopted MWPCs with cathode pad
readout. In preparation of the TPC TDR [3] alternative
readout concepts had also been considered, such as Ring
Cathode Chambers (RCCs) [10] or Gas Electron Multipliers
(GEMs) [11] as amplification structures. However, those con-
cepts seemed, though conceptually convincing, not yet in an
R&D state to be readily adopted for a large detector project,
which had to be realized within a relatively short time span.
3.2. Mechanical structure
The azimuthal segmentation of the readout plane is common
with the subsequent ALICE detectors TRD and TOF, i.e. 18
trapezoidal sectors, each covering 20  in azimuth. The radial
dependence of the track density leads to di↵erent requirements
for the readout-chamber design as a function of radius. Con-
sequently, there are two di↵erent types of readout chambers,
leading to a radial segmentation of the readout plane into Inner
and Outer ReadOut Chamber (IROC and OROC, respectively).
In addition, this segmentation eases the assembly and handling
of the chambers as compared to a single large one, covering the
full radial extension of the TPC.
The dead space between neighboring readout chambers is
minimized by a special mounting technique (described in
Sec. 3.4) by which the readout chambers are attached to the
endplate from the inside of the drift volume. The dead space
between two adjacent chambers in the azimuthal direction is
27 mm. This includes the width of the wire frames of 12 mm
on each chamber (see Fig. 9) and a gap of 3 mm between two
chambers. The total active area of the ALICE TPC readout
chambers is 32.5 m2. The inner and outer chambers are ra-
dially aligned, again matching the acceptance of the external
detectors. The e↵ective active radial length (taking edge e↵ects
into account) varies from 84.1 cm to 132.1 cm (134.6 cm to
246.6 cm) for the inner (outer) readout chambers. The mechan-
ical structure of the readout chamber itself consists of four main
components: the wire planes, the pad plane, made of a multi-
layer Printed Circuit Board (PCB), an additional 3 mm Stesalit
insulation plate, and a trapezoidal aluminum frame.
3.2.1. Wires
The wire length is given by the overall detector layout and
varies from 27 cm to 44 cm in the inner chambers, and from
45 cm to 84 cm in the outer chambers.
GROUND
Figure 9: Cross section through a readout chamber showing the
pad plane, the wire planes and the cover electrode.
At constant potential, the gas gain increases with decreasing
anode-wire diameter. Thus, a small anode-wire diameter is pre-
ferred. Owing to their superior strength, gold-plated tungsten is
preferable to copper–beryllium (an alloy of 98%Cu and 2%Be)
for the thin anode wires. However, for the thicker cathode and
9
Figure 2.2: Cross-section of a mwpc based roc, taken from [20].
the rocs, forms a g s-tig t env lope and ensures an el ctrical insulation from the other
detectors of the experiment. The insulation is also achieved by the CO2 filled gas gap
in-between the field cage and the containment vessel. The field cage consists of two
parts, the vessel and the actual field cage, consisting of 165 strips (in both drift volumes)
surrounding the drift volume on the inside and outside. The vessel has a set of coarsely
segmented guard rings which help to avoid the build-up of ch rges o th surface. It is
made out of 13 mm wide stri s of alumin t pe on both sides of the vessel, spaced by
92 mm. The two corresponding rings are electrically connected throug small drilled holes
with an aluminium foil feed-through which are sealed with epoxy. The potential of the
individual rings is defined by a resistor chain, which is connected on one side to the central
electrode and on the other side to ground, to match the field gradient of the inner, finer
segmented field cage. Also the potential on the field cage strips is defined by a separate
resistor chain. With this design it is possible to have the central electrode on a potential
of 100 kV to generate a drift field of 400 V/cm. This leads to a drift velocity of 2.65 cm/µs
for the electrons at nominal temperature and pressure and with that to a maximum drift
time of 94µs [20].
2.2.2 The MWPC Readout Chambers
At the end of the drift regions there are the rocs mounted to amplify the incoming
electrons and read out the signals. Different technologies have been considered and tested
as amplification structures, such as Ring Cathode Chambers (rccs) or Gas Electron
Multipliers (gems) [29]. However, at the time only the Multi-Wire Proportional Chambers
(mwpcs) met the requirements of the alice tpc and were in an R&D state to be readily
adopted in such a large detector project. Therefore mwpcs were employed as the rocs [20].
The mwpcs of the tpc have a commonly used scheme of wire planes. In the cross-section
shown in figure 2.2 one can see (from the bottom up) the segmented cathode pad plane
for the readout, then the anode wire plane for the amplification and next the cathode
wire plane. On the very top there is another wire plane, the gating grid. This is used to
shield the amplification regions from the drift volume. In the absence of a valid trigger
the gate can be closed to prevent further electrons to enter the amplification region and
11
Chapter 2 – TPC with a Continuous Readout
also to prevent ions created in the avalanche process to drift back into the drift volume. If
they would not be stopped, they would accumulate in the tpc volume and cause severe
distortions of the drift field leading to deflections of the drifting electrons and with that to
degradations of the space-point resolution [20].
2.3 Readout Upgrade of the ALICE TPC
The just mentioned necessity of the gating grid imposes an intrinsic upper limit on the
readout rate of the tpc of ∼3.5 kHz. The gate must be opened for about 100µs to collect
all electrons even from the maximum drift distance. Afterwards the gating grid must
remain closed for about 180µs, due to the low ion mobility of µion = O(10−3) · µelectron,
to stop all ions generated during the amplification process. It must be noted that the
data rate of the present readout system is an even more limiting factor, in central PbPb
collisions is a readout rate of only ∼300 Hz possible [5]. Such a system can not be used
in lhc Run 3 where alice plans to inspect all events of the foreseen interaction rate of
50 kHz in PbPb collisions. So the whole readout system, the rocs and the subsequent
readout chain, must be upgraded in order to overcome this limitation.
2.3.1 Upgrade Goals
The goal of the tpc upgrade program is to employ a continuous readout system to overcome
the intrinsic rate limitation. At the same time the upgraded tpc must preserve its current
performance in momentum and dE/dx resolution. This means, the distortions must stay at
a tolerable level, so that they can be corrected to a few hundred µm, which is the intrinsic
spacial track resolution of the alice tpc [5]. The local energy resolution must not exceed
σE
E < 12 % for an energy disposition of 5.9 keV, evaluated with a radioactive
55Fe source [5].
To keep the distortions small enough, the Ion Back-Flow (ibf), which is defended as
ibf = 1 + ε
Geff
, (2.1)
must be less than 1 %, where Geff is the effective gas gain of 2 000 and ε is the number of
into the drift region back drifting ions per incoming electron [5].
2.3.2 GEM based Readout Chambers
The goals can be achieved by using gems for the gas amplification. The new rocs will be
build with a quadruple gem stack (four gem foils on top of each other) for the amplification
and ion blocking, and with an anode pad readout. The working principle and the intrinsic
ion blocking capabilities of a gem foil can be explained easiest on a picture like it is shown
in figure 2.3. The gem consist of a thin insulating polymer foil which is metal coated on
both surfaces. A regular pattern of small holes is etched through the metal layers and the
polymer in a conical shape so that the wider radius is on the entry sides. By applying
a potential difference between the metal surfaces, an electric field develops in the holes.
Already with a moderate potential difference of 200 V high electric fields of 40 kV/cm can
be generated in the gaps [31].
12
2.3 – Readout Upgrade of the ALICE TPC
Figure 2.3: Simulation of the gas amplification of two electrons in a gem hole with
Garfield/Magboltz. The electron (yellow) and ion (red) paths are projected to the cross-
section plane. Each green marked spot indicates an ionisation process, taken from [30].
13
Chapter 2 – TPC with a Continuous Readout
44 The ALICE Collaboration
confirmed by the pulseheight data of a single-wire proportional counter used as reference (left panel).
The wire counter data is used to correct the gain variations of the GEM detector. In Fig. 5.3 the corrected
GEM gain is shown for a period of about 21 hours, just after the gain was increased. Within this time
the corrected gain remains very stable, within 0.45%, as indicated by the fit of the right panel of the
figure. Thus, no settling time is observed after changing the operating conditions. It should be noted that
a humidity level of about 180 ppm of water was maintained for the entire period.
5.1.3 Results of ion backflow measurements
Baseline solution
A suitable working point in terms of ion backflow and local energy resolution was found by utilizing a
quadruple GEM system in which the foils in layer 1 and 4 have a standard hole pitch (Standard, 140µm),
whereas the foils in layer 2 and 3 have a hole pitch that is two times larger (Large Pitch, 280µm). This
arrangement, denoted S-LP-LP-S, allows to block ions efficiently by employing asymmetric transfer
fields and foils with low optical transparency. An increasing sequence of gas gains down the GEM stack
helps reducing the ion backflow since ions created in the inner two layers are blocked more efficiently.
On the other hand, the efficiency for electron transmission, in particular in the first two layers, is also
affected by this configuration. Therefore, a combined optimization with respect to both ion backflow and
energy resolution is mandatory.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
6
8
10
12
14
16
18
20
U
GEM3
/U
GEM4
=0.95
 U
GEM2
=235 V
 U
GEM2
=255 V
 U
GEM2
=285 V
 
 
σ
 (%
)
IBF (%)
U
GEM3
/U
GEM4
=0.8
 U
GEM2
=235 V
 U
GEM2
=255 V
 U
GEM2
=285 V
Figure 5.4: Correlation between ion backflow and energy resolution at 5.9 keV in a quadruple S-LP-LP-S GEM in Ne-CO2-N2
(90-10-5) for various settings of DUGEM2. The voltage on GEM1 increases for a given setting between 225 and
315 V from left to right. The voltages on GEM3 and GEM4 are adjusted to achieve a total effective gain of 2000,
while keeping their ratio fixed. The transfer and induction fields are 4, 2, 0.1 and 4 kV/cm, respectively.
In Fig. 5.4 the ion backflow and energy resolution at 5.9 keV obtained with a S-LP-LP-S arrangement are
summarised for various voltage settings, illustrating the competing mechanisms of electron transmission
and ion blocking. The results are obtained in a Ne-CO2-N2 (90-10-5) gas mixture for different com-
binations of DUGEM1 and DUGEM2, and at different ratios DUGEM3/DUGEM4 . Clearly the ion backflow
improves for lower gains at GEM1 and GEM2, while the energy resolution deteriorates accordingly.
Typical values of ion backflow around 0.7% at energy resolutions of ⇠12% are reached. This per-
formance fulfills the requirements for maximum allowable space-charge distortions and proper dE/dx
Figure 2.4: Energy resolution and ibf for different gem voltage settings, taken from [5].
Figure 2.3 shows a simulation of an amplification process of two inco ing electrons in a
gem hole. All the paths are projected to the cross-section plane. Two electrons (the yellow
lines) enter the hole and start an avalanche due to the high electric field. E ch reen dot
marks the position, were an ionisation took place. Finally, a plurality of electrons leave
the hole on the lower side after the amplification region, of which some, following the field
lines, end on the bottom side of the foil. The ions (red lines) which are also created at
the ionisation spots, emerge the hole on the upper side, following the field lines as well.
They end mostly on the top side of the foil because the field above the gem is much lower
than the field inside the hole. The extraction of the electrons can be supported by using
a higher transfer field below the gem [30]. By stacking multiple layers of gem foils high
total gains can be achieved with only moderate gains at each individual foil.
The necessary local energy resolution could be reached with a stack of four gems, while
utilising the intrinsic ion capturing mechanism to achieve an ibf of less than 1 %, as it
is shown in figure 2.4. The achieved resolution with the corresponding ibf is plotted for
different voltage settings. The voltage difference on gem 2 (gem 1 is the uppermost foil,
facing the drift volume while gem 4 is last one before the pad plane) is directly given,
the voltage on gem 3 and gem 4 were adjusted for a gain of 2 000 while keeping the ratio
constant. The voltage on gem 1 increases from left to right from 225 V to 315 V. As can
be seen, there is some parameter space available which reaches the required properties [5].
2.3.3 Upgrade of the Front-End Electronics
The new chambers result in three major changes in the requirements for the fee and the
readout systems. First, the polarity of the gem detector signal is opposite to the one of
the mwpc. Second, the continuous readout scheme makes it necessary to develop new
electronics for a concurrent data sampling and data transmission. Third, the increased
interaction rate together with the continuous readout leads to a strongly increased data
14
2.4 – Timeline of the TPC Upgrade
rate. This makes a completely new readout chain necessary, which is described in detail in
chapter 3.
2.3.4 Towards an Online Track Reconstruction
The average data size of one minimum bias PbPb event of the upgraded tpc is expected
to be ∼20 MB after an already applied Zero Suppression (zs), which reduces the volume
already by a factor of about 3 [5]. This would lead (with an interaction rate of 50 kHz) to a
data rate of 1 TB/s into the online data system. Integrating the data over the full Run 3 of
the lhc, an amount of 3 EB (1018B) would be collected. Since those numbers exceed both,
the predicted available bandwidth as well as the storage space, additional compression
methods on top of the zs must be applied directly during the readout to reduce the data
rate to less than 1 MB per interaction [5].
As a first processing step is a Cluster Finder (cf) foreseen (the development of the
cf is the main topic of this thesis) which is supposed to run directly in the online farm
on fpgas. After the cluster finding a Huffman compression can be done on selected
parameters. During the PbPb data taking of lhc Run 1 in 2011 was such a compression
scheme was already used successfully, leading to a compression factor of about 4 [5]. With
further optimisations on the format of the cluster data in the software framework and the
compression algorithm, a compression factor of five to seven is expected [3].
In order to further reduce the data volume, online track reconstruction is used, whereupon
the clusters are assigned to the found particle tracks. Thus clusters that do not belong to
physically relevant tracks (e.g. clusters from noise or delta electrons) can be removed. Based
on experience from the Run 1 data taking, a compression factor of two is expected. In
addition, the parameter distributions can be optimised for entropy encoding and eventually
some cluster parameters can be replaced by track-based properties. With that, an additional
compression factor of two to three is expected [5].
This gives a total expected compression in the order of 20 which reduces the average
event size to less than 1 MB. With that is the data rate to the permanent storage reduced
to ∼50 GB/s and only a total amount of around 150 PB for the complete Run 3 data is
needed [5]. The online cf is the very first step in this chain.
2.4 Timeline of the TPC Upgrade
The upgrade program and all the individual tasks which are needed to successfully assemble
the upgraded tpc are described in [5]. Though, there were some adaptions and delays
since. The main milestone that must be kept is the Long Shutdown (ls) 2 which started in
December 2018. During this shutdown the tpc must be taken out of the L3 magnet and
brought to the surface in a clean room in which the replacement of the rocs takes place.
The program started already six years earlier at the beginning of 2013 with the R&D
phase for the rocs, followed closely with the beginning of the design phase of the fee in
mid 2013. The production of the first prototypes could then take place in 2015 until one of
the almost final pre-production irocs was tested with the first Front-End Cards (fecs) in
a test beam campaign in May 2017. This was also used to test the first Firmware (fw)
components, developed in the scope of this thesis, in a real readout system. During the
15
Chapter 2 – TPC with a Continuous Readout
rework of the tpc in the cleanroom between March 2019 and February 2020 [4] there are
many individual tasks, described in detail in [32]. It is expected that the first side of the
tpc will be equipped with the new rocs and the new electronics in October 2019, so that
the system can then be tested. This will be the first time that realistic signals can be read
out by the new readout system. It will be the first step in the commissioning of the fw
modules developed in the scope of this thesis. After the reinstallation of the tpc in the
experiment which will be finished in May 2020 [4], further validations can be done to be
ready for the restart of the lhc in spring 2021.
16
Chapter 3
Readout Strategy for the ALICE TPC
The readout scheme for the alice tpc in Run 3 was originally described in the Technical
Design Report (tdr) for the Upgrade of the alice tpc [5] and in the tdr for the Upgrade
of the Readout & Trigger System [26]. The basic idea is that signal pulses induced on the
individual pads of the pad plane of the Readout Chambers (rocs) are amplified, shaped
and digitised continuously by the newly developed sampa asic [26]. Those are placed
on Front-End Cards (fecs), which are located close to the detector. The continuously
sampled data is then transferred via two GigaBit Transceiver (gbt) asics and the versatile
optical link components to the Common Readout Unit (cru). Further processing steps,
especially a clusterisation which is the main topic of this thesis, is then applied in the cru.
The crus are located off-detector, outside of the radiation area, in Counting Room (cr) 1,
see section 3.4. Always two crus are hosted by one First Level Processor (flp) server.
This readout chain is schematically shown in figure 3.1. Two major design changes with
respect to the original concept of the tdrs were implemented. It was observed that due
to the so called Common Mode (cm) effect, no Zero Suppression (zs) can be applied on
the fec, which is why the sampa must be operated in Direct ADC Serialisation (das)
mode to read out unmodified raw data. To be able to transmit this data volume, the
adc sampling frequency had to be reduced from 10 to 5 MHz [33]. This chapter covers a
short introduction to the original strategy and presents then the cm effect. Afterwards, a
study is performed about the possible application of the loss-less Huffman compression to
reduce the data volume without the reduction of the sampling frequency, which was also
part of this thesis. Then the updated readout scheme is introduced and finally the main
components of the readout chain, the sampa, the gbtx and the cru, are described with
the focus on the tpc readout for Run 3.
3.1 The TDR Baseline
The general parameters of the digitisation process, such as the sampling rate and the adc
precision were taken from the currently installed system which was running very successfully
during Run 1 and 2 of the lhc. The individual pads of the tpc will be connected to an
asic, called the sampa, which includes a Charge Sensitive Amplifier (csa), a shaper and
a 10 bit adc, sampling with a frequency of 10 MHz. Each fec contains five sampas with
17
Chapter 3 – Readout Strategy for the ALICE TPC
FEC
FLP
SAMPA 0
SAMPA 1
SAMPA 2
SAMPA 3
SAMPA 4
GBTx 0
GBTx 1
GBT
SCA
VTRx
VTTx
CRU
CRU
Figure 3.1: The main components of the tpc readout chain. The fec on the left hosts
five sampa asics which are connected to the two gbtx. The sc, implemented through
the gbt-sca, is indicated with dotted arrow lines. The fec is read out by two crus,
hosted by a flp server. Each cru receives the data from multiple, up to 20, fecs.
32 channels each. This gives in total 160 channels per fec. The five sampas will therefore
generate with those parameters a combined continuous data rate of
5× 32× 10 MHz× 10 bit = 16 Gbit/s. (3.1)
Such a fec is schematically shown on the left side of figure 3.1. The sampas on the
left are connected to the two gbtx asics which are responsible for the communication
with the outside-world. With 91 fecs installed in each of the 36 sectors of the tpc, a
total amount of 3 276 fecs will be needed in the final system [34]. The 524 160 individual
readout channels will generate an overall data rate of
3 276× 16 Gbit/s = 52.416 Tbit/s = 6.552 TB/s (3.2)
for the whole tpc. These numbers are changed slightly with respect to the TDR due to
adaptions of the pad plane layout. To reduce the overall data volume, it was foreseen to
apply a zs on the digitised signals directly within the sampa chip, which contains a Digital
Signal Processor (dsp) also for this purpose. In a zs readout, those values are dropped
which are below a certain threshold. The general assumption for a zs is that the dropped
value is too small to contain any useful information and therefore can be omitted. The
remaining signals are then run-length encoded, which requires an additional information
about the start and the length of the not zero suppressed sequence [5]. The zs algorithm
can also be more complex and can require for example several consecutive time bins above
the threshold or it stores in addition some pre- and post-samples around the bin of interest.
Applying a fixed threshold on each channel makes it necessary to first restore the baseline.
18
3.1 – The TDR Baseline
In a real system, the baseline can be different for each channel which requires a (channel
wise) pedestal subtraction. Further, with the Gas Electron Multiplier (gem) tpc, a big
contribution of the so called cm effect is expected which needs to be corrected first. The
cm effect is described in more detail in subsection 3.1.1.
As an alternative compression method (or applied in addition), the effect of loss-less
transformations like variable length codes (e.g. Huffman coding) were studied (see subsec-
tion 3.1.2). If the probability distribution of the individual adc values coming from the
detector is not uniform then it may make sense to assign short codes to values which occur
more often and long codes to values which occur rarely. Huffman coding approaches the
theoretical lower limit for the average word size which can be achieved by this concept
(also called the entropy of the data source).
To collect the data for further processing and storage, it is transmitted via two gbtx
asics and the Versatile Link components (a Versatile Transceiver, vtrx, to implement one
bidirectional link and a Versatile Twin Transmitter, vttx, for a second up-link) from the
fecs to the crus which are loceted off-detector in the cr. Each cru receives the data
of up to 20 half-fecs, so up to 1600 pads are combined in one cru. The mapping of the
individual pads via the sampas and the fec towards the crus is explained in more detail
in appendix B. An important note at this point is, that the data of a single fec is split and
sent evenly distributed to two independent crus in a way which is indicated in figure 3.1.
The data of sampa 0 and 1 is sent together with half of the channels of sampa 2 to one
cru while the remaining channels of sampa 2 are sent together with the data of sampa 3
and 4 to another cru. With this scheme, always complete pad rows can be recovered in a
cru which is very important for the cluster finding later on.
The gbtx can transmit either 3.2 Gbit/s with an automatic error correction enabled,
or 4.48 Gbit/s without (for more details see section 3.3). This implies that, to be able to
transmit the 16 Gbit/s of each fec, a compression factor of
16 Gbit/s
2× 3.2 Gbit/s = 2.5 or
16 Gbit/s
2× 4.48 Gbit/s = 1.8 (3.3)
must be achieved, either with a zs, with other compression methods or with a combination of
both. Especially for a Huffman encoded data stream it is important to use such error correc-
tion methods, otherwise it is easily possible, that whole chunks of data become corrupt due to
a single bit-flip in the transmission. To avoid this, only the gbt mode with error correction
and therefore lower bandwidth should be used for a Huffman compressed readout scheme.
3.1.1 The Common Mode Effect
It is well known that in rocs, like they are used for the tpc, the readout pads are coupled
capacitively to the amplification structure. This coupling is independent of whether this
structures are wires, as in the current tpc, or gems, as in the upgraded tpc and leads
to the cm effect. An avalanche has a flowing current as consequence which introduces a
voltage drop on the electrode. Due to the voltage drop, a correlated signal with opposite
polarity is induced on all pads. This is shown schematically in figure 3.2. The pads can be
seen as a series of capacitors connected in parallel. When a signal occurs in one pad (red),
19
Chapter 3 – Readout Strategy for the ALICE TPC
bottom side of lowermost GEM
pad plane
. . .
Figure 3.2: Schematic drawing of the origin of the cm effect due to capacitative coupling
of the individual pads. A signal in one pad (red) will induce a correlated signal with
opposite polarity on all other pads (blue) which face the same gem electrode.
all other pads will see an inverted fraction of the signal (blue), where the amplitude of the
cm signal on each pad corresponds to the original signal divided by the total number of pads
facing the same gem electrode (assuming all pads have the same capacitance). For multiple
avalanches occurring at the same time those cm signals pile up. In a high-multiplicity
environment this piled up signal leads to a reduction of the baseline and an effective noise
contribution. By adding additional capacitance to the gem one could damp the cm effect.
However, this would counteract the efforts done in minimising the stored energy in order
to improve the stability and safety of the system.
Different algorithms which could be applied directly in the sampa dsp have been studied
and were presented in [33], with the goal to restore the baseline on a single-pad level. It
occured that those filters could only partially restore the performance of the separation
power between electrons and pions of the tpc. A sizeable degradation of roughly 20 % still
remained. The document states also, that a better restoration of the baseline could be
achieved by taking the information of a large number of pads into account. The cm signal
should be the same for all pads below the same gem electrode. By averaging over a large
number of pads, excluding the signal peaks, the baseline can be restored sufficiently. The
result of such an averaging approach with three different signal-exclusion mechanisms is
shown in figure 3.3. On the left side the remaining bias of the baseline is displayed, the
effective noise contribution of the cm effect on the right side. The blue points display the
performance with a simple constant threshold of 2.5 adc counts above which the adc
values are assumed to be signals and therefore are excluded. The performance shown by the
magenta and black points is achieved by using a real peak finding algorithm where the peak
is either completely taken out in case of the black points or the last time bin before the peak
is still used for the calculation of the average charge. Given the black symbols, it can be
seen that by averaging over more than about 100 pads (without a signal) the baseline can
be restored with a negligible bias and noise contribution. However, each sampa has only
the information of its 32 channels available. This is the reason why this correction method
can not be applied in the sampa dsp. Since signals from up to 1600 pads are received by
a single cru, this filter can be placed there to restore the anticipated performance of the
tpc, though, this requires the readout of not zero suppressed adc values [33].
3.1.2 Application of a loss-less Data Compression
To reduce the data volume without the need of a zs, the loss-less Huffman compression
method was proposed and studied. It is a well-known technique, proposed by David
20
3.1 – The TDR Baseline
4 Harald Appelshäuser et al.
20!Mesut Arslandok, Marian Ivanov 
M%sUp%=%0.15%
M%sUp%=%0.25%
M%sUp%=%0.35%%
Default%BC3:%Separa+on%power%vs%mul+plicity%in%small%%eta%bin%
M%ideal%(no%Xtalk)%
M%reference%(no%correc+on)%
Slope%parameter%op+miza+on%
0.35<tan(θ)<0.65%%
&&%%
M0.65<tan(θ)<M0.35%
sDown%=%0.65% sDown%=%0.75% sDown%=%0.85%
MPV%~20,%%σNoise%%=%1ADC,%ZS%=2σ%
TPC specifications for SAMPA, 28.08.15 
19!Mesut Arslandok, Marian Ivanov 
M%sUp%=%0.15%
M%sUp%=%0.25%
M%sUp%=%0.35%%
sDown%=%0.65% sDown%=%0.75% sDown%=%0.85%
Default%BC3:%Sep ra+on power%vs mul+plicity%
M%ideal%(no%Xtalk)%
M%reference%(no%correc+on)%
Slope%parameter%op+miza+on%
MPV%~20,%%σNoise%%=%1ADC,%ZS%=2σ%
TPC specifications for SAMPA, 28.08.15 
21!Mesut Arslandok, Marian Ivanov 
M%sUp%=%0.15%
M%sUp%=%0.25%
M%sUp%=%0.35%%
Default%BC3:%Separa+on%power%vs%mul+plicity%in%small%%eta%bin%
M%ideal%(no%Xtalk)%
M%reference%(no%correc+on)%
Slope%parameter%op+miza+on%
M0.2<tan(θ)<0.2%%
sDown%=%0.65% sDown%=%0.75% sDown%=%0.85%
MPV%~20,%%σNoise%%=%1ADC,%ZS%=2σ%
TPC specifications for SAMPA, 28.08.15 
26!Mesut Arslandok, Marian Ivanov 
Modiﬁed%BC3:%Separa+on%Power%vs%mul+plicity%
sDown%=%0.7% sDown%=%0.8% sDown%=%0.9%
M%sUp%=%0.4%
M%sUp%=%0.5%
M%sUp%=%0.6%%
M%ideal%(no%Xtalk)%
M%reference%(no%correc+on)%
Slope%parameter%op+miza+on%
MPV%~20,%%σNoise%%=%1ADC,%ZS%=2σ%
TPC specifications for SAMPA, 28.08.15 
27!Mesut Arslandok, Marian Ivanov 
M%sUp%=%0.15%
M%sUp%=%0.25%
M%sUp%=%0.35%%
sDown%=%0.65% sDown%=%0.75% sDown%=%0.85%
Modiﬁed%BC3:%Separa+on%power%vs%mul+plicity%in%small%%eta%bin%
M%ideal%(no%Xtalk)%
M%reference%(no%correc+on)%
Slope%parameter%op+miza+on%
0.35<tan(θ)<0.65%%
&&%%
M0.65<tan(θ)<M0.35%
MPV%~20,%%σNoise%%=%1ADC,%ZS%=2σ%
TPC specifications for SAMPA, 28.08.15 28!Mesut Arslandok, Marian Ivanov 
M%sUp%=%0.15%
M%sUp%=%0.25%
M%sUp%=%0.35%%
sDown%=%0.65% sDown%=%0.75% sDown%=%0.85%
Modiﬁed%BC3:%Separa+on%power%vs%mul+plicity%in%small%%eta%bin%
M%ideal%(no%Xtalk)%
M%reference%(no%correc+on)%
Slope%parameter%op+miza+on%
M0.2<tan(θ)<0.2%%
MPV%~20,%%σNoise%%=%1ADC,%ZS%=2σ%
TPC specifications for SAMPA, 28.08.15 
all	η	
all	η	
0.35	<	|η|<	0.65	
0.35	<	|η|<	0.65	
|η|<	0.2	
|η|<	0.2	
Default		
BC3	
Modiﬁed		
BC3	
Fig. 2: Electron-pion separation power as a function of charged-particle multiplicity. Note that a multiplicity
of 4500 corresponds to a central Pb–Pb collision at 50 kHz, i.e. superimposed on a pile-up of 5 minimum bias
collisions. The r sult re shown for the BC3 (upper row) and modified-BC3 (lower row) filters. The black and red
curves show the performance without and with common-mode noise, respectively. Different parameter settings of
the filter algorithms are shown in green, blue, and magenta.
Such averaging over a large number of pads is not possible in the present design of the SAMPA DSP,
where only data from a single input channel is available at a time. On the other hand, the data of up to
1600 pads are merged in a single CRU, where digital data processing is performed in an FPGA. This
requires, however, that the raw ADC data are passed to the CRU without any further compression.
These considerations lead to the readout scheme described in this document, where all raw ADC signals
from the SAMPA are read out without compression.
21"Mesut Arslandok, Harald Appelshaeuser TPC specifications for SAMPA, 28.08.15 
Summary*
!  *Es(ma(on1:*Threshold*2.5*
!  *Es(ma(on2:*Signal*detec(on*
!  *Es(ma(on3:*Usage*of*non6signal*pads*
Averaging*over*non6signal*pads*gives*(almost)**
ideal*es(ma(on*of*X6talk*baseline*
Fig. 3: Residual average baseline shift (left) and effective common-mode noise contribution (right) as a function of
the number of pads used for the average calculation. Three different approaches are used to reject particle signals
from the baseline calculation. A constant threshold of 2.5 ADC counts (blue) is compared to a simple peak-finding
algorithm, where either the last value before the peak is used inside the peak region (magenta), or the peak region
is completely omitted (black).
Figure 3.3: Remaining bias of the adc value (left) and additional noise from the cm
effect (right) afte correctin with the averaging method. The esult is shown as a fun ion
of number of pads which were taken into account. The three differe t colours show the
three different peak exclusion algorithms (fixed threshold in blue, complete peak removal
in black and in magenta where the first time bin of the peak was still included in the
averaging), taken from [33].
Huffman in 1951 [35], that can easily be implemented in hardware and gives the optimal
loss-less compression for a given set of probabilities. The idea behind the Huffman encoding
is that the words which need to be compressed are replaced by codes of a variable length,
where the length depends on the probability of their occurrence. Words that occur more
frequently get shorter codes, while words that are rare get longer ones. By sending a
stream of concatenated Huffman codes, the total volume of the data can be reduced. The
codes are prefix codes, meaning that no code is the prefix of another one. Because of this,
the stream can be decoded, by going through and detect the individual codes. The already
mentioned possibility of a bit-flip during the transmission becomes therefore relevant. If
one bit is changed, the detected code will be a wrong one, but even worse, the code will be
detected as one with a different length. Due to that, all subsequent codes will be decoded
wrongly as well. Therefore it must be ensured that no bit-flip can happen, or that it can be
detected and corrected beforehand. Those codes can also be represented as a binary tree
where the symbols correspond to the leafs of the tree [36]. The code can then be obtained
by traversing the tree form the root node to the leaf node of the desired symbol, adding
the character 0 to the binary codeword whenever the top branch is taken or the character 1
for the bottom branch. This is indicated in figure 3.4. Starting from the root node on the
left, the codewords c0 to c4 can be obtained by following the branches and just appending
the corresponding symbol to the code word. With a known probability distribution of the
expected signals, those Huffman codes can be precalculated and stored in a Lookup-Table
(lut) on the Front-End Electronics (fee).
For the study whether the Huffman encoding is applicable, the probability distributions
were generated from the so called black events which were taken with the tpc during
the PbPb run in the year 2010. Within a short period of a few minutes, around 1 000
events were recorded without a zs applied. This raw data is then used to overlay multiple
collisions inside a timeframe to emulate the occupancy levels expected for Run 3. The
21
Chapter 3 – Readout Strategy for the ALICE TPC
c4 = 1111
1
c3 = 11100
1
c2 = 1100
1
c1 = 100
1
c0 = 00
Figure 3.4: Example of a small Huffman tree with only five codes.
0 200 400 600 800 1000
ADC signal
8−10
7−10
6−10
5−10
4−10
3−10
2−10
1−10
1
pr
ob
ab
ilit
y Entries    5.473787e+09
Mean    42.24
Std Dev     16.64
(a) Raw tpc signals.
0 500 1000 1500 2000
 ADC signal∆
8−10
7−10
6−10
5−10
4−10
3−10
2−10
1−10
1
pr
ob
ab
ilit
y Entries    5.473787e+09
Mean     1024
Std Dev     7.119
(b) Differential tpc signals.
Figure 3.5: The probability spectrum of the adc values of the alice tpc for PbPb
collisions at √sNN = 2.76 TeV. On the left side is the probability of the pure detector
output plotted. The right side shows the probability distribution for the differential signal,
when the difference of two consecutive signals of the same pad is transmitted, shifted by
1 024 to avoid negative numbers.
resulting probability spectrum of the tpc raw signals can be seen in figure 3.5a. With
the Huffman coding, an average code length of 3.21 bit was achieved, giving an average
compression factor of 10 bit/3.21 bit = 3.12. It is clear that since the spectrum is rather
wide, leading to a broad Huffman tree, it is possible to still improve the compression ratio.
By encoding the differential signal, where always the difference of two consecutive time
bins of the same channel is used, instead of the raw adc value, a narrower distribution
can be achieved. Such a distribution is shown in figure 3.5b which results in an average
code length of only 3.01 bit and therefore an improved average compression factor of 3.32.
For the second spectrum, a fixed offset of 1 024 was added to avoid the need of handling
negative numbers in the fee. Both compression factors would fulfil, in the ideal case, the
requirement of > 2.5 to be able to transmit all the data from a fec. This very narrow
distribution becomes even more relevant by going from the standard Huffman, where the
code words can have an arbitrary length, to the truncated Huffman, where the maximum
code length is limited. Since the encoder has to be implemented in the frontend chip and
22
3.1 – The TDR Baseline
the applied Huffman table is stored there in a lut for easy use, the maximum length is
limited to 12 bit. The truncation is usually done by generating a normal standard Huffman
tree and only the sufficiently short codes are used. All other words that can not be encoded,
are transmitted as raw value with a special Huffman code prepended as a marker. After the
detection of this special code, the decoder recognises the following bits as a raw word. This
treatment increases the data volume since in such a case not only the original data needs
to be transmitted, but also the special code beforehand. So the compression of the other
words must be efficient enough to compensate for this overhead. And the compression is
only efficient enough, if those raw data words need to be transmitted very rarely, or to
put it another way, if the probability distribution is very narrow. In this case with the
truncation to 12 bit only, a marginally worse average code length of 3.1 bit was observed,
giving an average compression factor of 3.23.
A different way to achieve this truncation is to generate the so called length-limited
Huffman codes directly. Here, an additional constraint is added to the generation procedure
of the code words: the requirement that the length of all code words has to be less or equal
to a given maximum length. A widely used algorithm for the construction of the code
table is the Package-Merge algorithm from Larmore and Hirchberg [37]. As written in the
paper, the algorithm actually solves the Coin Collector’s problem, but it is also shown
there that the length-limited Huffman coding problem can be reduced to an instance of
the Coin Collector’s problem. Because of that, the Package-Merge algorithm can also
be applied to find an optimal Huffman table with this additional constraint. With this
algorithm the number of words to be encoded and the maximum length of the codes can
be fixed beforehand to generate the table accordingly. All words which can not be encoded
are transmitted again as raw values with one of the codes prepended as a marker.
The optimal parameters for the length-limited Huffman were selected with the help of two
criteria, the achieved compression factor and the required buffer size. Since the Huffman
codes have a variable length, a fifo (First In, First Out) must be placed after the encoder
to compensate for the different lengths. Whenever a code occurs which is longer than the
original 10 bit adc value, then this fifo is filled. A scan of the resulting compression factors
for a variety of differential parameters is shown in figure 3.6, relative to the one obtained
from the 12 bit truncated Huffman. As can be seen, the compression factor decreases for
small code lengths and for an increasing number of words to be encoded. This is kind of
intuitive, because if many words must be packed into a limited set of available codes, then
the average number of used bits will increase and therefore the average compression factor
decreases. For a comparable length of the code words (from 10 bit on), the length-limited
Huffman performs slightly better than the truncated Huffman. This is expected since the
Package-Merge algorithm finds the optimal code table for a given parameter set, whereas
the table of the truncated Huffman is only optimal if all codes up to an arbitrary length
would be taken into account. The compression factor peaks around 60 encoded words for
the 10 bit case. It still increases by increasing both the number of encoded words and the
code length, but not significantly. This means, by encoding only the 60 most probable
words out of the differential spectrum range of 0 to 2 047, the best compression factors can
already be achieved. This is again the reason why the spectrum needs to be so narrow. The
optimal parameters of 60 words and 10 bit was then chosen despite the slight increase with
a few more words and a longer code length, due to the required buffer size. The needed
23
Chapter 3 – Readout Strategy for the ALICE TPC
6
8 10
12
14
code
 leng
th (bit)
50100
150200
250
#encoded words
0.6
0.7
0.8
0.9
1.0t
ru
nc
 
/ C
F
ll
CF
Figure 3.6: Compression factors achieved by the length-limited Huffman, relative to
the to ones from the 12 bit truncated Huffman. The ratios are shown as a function of the
number of encoded words and the maximum code length.
size increases dramatically, by almost a factor of two, as soon as the maximum code length
is longer than the width of the input data of 10 bit. When allowing the most probable
codes to be longer than this size, then it is clear that they occur rather often and fill the
buffer. Therefore, the slight penalty in compression factor was accepted with the benefit of
a smaller needed buffer size. With those settings, an average code length of 3.04 bit was
found while compressing again the same data set, giving an average compression factor of
3.29. This is again in the same range as for the truncated Huffman.
Then the stability of the performance of those Huffman algorithms — the truncated
Huffman and the 60 words and 10 bit length-limited Huffman — were studied for a variety
of cases which are not unlikely to occur in the final detector. The cm effect was applied,
the noise level artificially increased by a factor of two and the gain of the detector varied
by up to 20 %. Every possible combination of these effects was also exercised. For all
cases, the optimal Huffman table was generated and truncated to a maximum length of
12 bit and the optimal length-limited table calculated. The resulting compression factors
as a function of the occupancy of the detector is shown in figure 3.7 for the length-limited
Huffman. It can be seen that even for the ideal case, the achieved compression factor is for
an occupancy of more than around 40 % below the required factor of 2.5. Although the
maximum occupancy is expected to be not higher than 30 % [5], a bigger safety margin
is needed for a reliable detector operation. It can also be seen that as soon as the noise
increases, the overall compression factor decreases significantly. For a noise increase of a
factor of two, the occupancy limit, until which the compression is still good enough, is
reduced to only 20 % which is definitely too low. This observation is very plausible since
when the fluctuation due to the higher noise gets bigger, the average differences are also
higher and the peak in the probably distribution becomes broader. All the other effects
24
3.1 – The TDR Baseline
0.0 0.1 0.2 0.3 0.4 0.5 0.6
occupancy (%)
2.0
2.5
3.0
3.5
4.0
4.5
co
m
pr
es
sio
n 
fa
cto
r no changes
CM applied
noise increased
gain variation
noise + CM
CM + gain
noise + gain
noise + CM + gain
Figure 3.7: Compression factor of the length-limited Huffman encoding as a function of
the occupancy for different modifications of the data set.
have only a minor impact on the achieved compression. These statements are true for
both, the truncated and the length-limited Huffman. Indeed, both show a very similar
compression factor however, the factor achieved by the length-limited Huffman is especially
for a high occupancy environment up to 3 % better compared to the truncated Huffman.
These plots were generated for the case that the probability distribution is known for
the current readout case beforehand. A more significant problem occurs if the running
condition changes during operation, e.g. if the noise suddenly starts to increase or the
gain changes. The Huffman tables need to be precalculated and loaded into the hardware.
Should a table be loaded which does not exactly fit to the current running conditions, the
compression factors will be even worse. This, together with the visible strong decrease
of the compression factor by an increased noise contribution, rules out the option of a
Huffman encoded raw data readout for the alice tpc.
3.1.3 Change of the Readout Scheme
After it was found that the zs can not be applied in the sampa dsp due to the cm effect, at
least not without significantly degrading the data quality, and after finding in addition that
the alternative approach of compressing the differential raw data with Huffman encoding
is not robust enough against detector effects such as an increase of the noise contribution
or changing running conditions, the readout scheme had to be reconsidered. There is no
additional option left for a data reduction, so the complete data volume of all sampas of
16 Gbit/s per fec has to be transmitted to the crus. An additional complication is, that
since the volume of the data already exceeds the available bandwidth, there is no additional
25
Chapter 3 – Readout Strategy for the ALICE TPC
TPC Upgrade TDR 67
6.4 Common front-end ASIC
The readout of the detector signals is done by a 32 channel FE ASIC that is developed as a common
solution for different ALICE sub-detectors. The concept assumes the integration of low-noise analog
components and continuously operating, digital functionality on the same silicon die.
The SAMPA project at the University of Sa˜o Paulo in Brazil targets the design, simulation, validation and
production of a signal acquisition and digital processing ASIC based on TSMC 0.13 µm mixed signal
technology [5]. This ASIC will comply with the requirements defined by the upgrade of the TPC, as well
as the ALICE Muon tracking detector.
6.4.1 Overview
A schematic of the SAMPA is shown in Fig. 6.4. The data fed into each of the 32 channels is processed
by a PreAmplifier/ShAper circuit (we reuse the name PASA from the current system for this block), a
SAR10 ADC11 and a DSP. Before being read out, the data are temporarily buffered in an event memory
and multiplexed. The PASA and DSP have configurable parameters that can be accessed via a common
logic and interface unit.
<
+
Cf
Rf
CSA
Cd
Shaper
ADC DSP
Elink
Elink
Elink
Elink
10b
Control & Trigger
10MSPS
32 channels
320Mbs
SAMPA
Cd
Cd
pad
pad
pad
Bias
Gain controlShaping time control
VREF+ VREF- IOsFEC
Buffer
Buffer
Buffer
Buffer
Figure 6.4: Schematic of the SAMPA ASIC for the GEM TPC readout, showing the main building blocks.
6.4.2 General requirements for the analog part
The requirements for the SAMPA are summarized in Tab. 6.2 and discussed in the following:
– The requirement on the signal-to-noise ratio (S:N) is taken over from the current system [1, 2]. In
order to reach the required detector resolution, a S:N ratio of 20:1 and 30:1 for MIPs12 is required
for the IROCs and OROCs, respectively13. At the same gas gain it is larger in the OROCs due to
the longer pads, which collect more ionization due to the longer track length seen.
– If the noise level of the current system (670 electrons) is retained, the required S:N can be achieved
by applying an effective gain of 2000 in the GEM stack. The maximum pad and time bin for each
charge cluster of a MIP track in this case corresponds to a charge of typically 2.1 to 3.2 fC (1.3 to
2⇥104 electrons).
10Successive Approximation Register (SAR)
11Analog-to-Digital Converter (ADC)
12Minimum-Ionizing Particle (MIP)
13The S:N ratio is calculated using the maximum pad and time bin for each charge cluster.
Figure 3.8: Block diagram of the sampa asic, showing the mai building locks. F r
each of the 32 channels, there is a csa, followed by a shaper, a 10 bit adc and a dsp,
taken from [26].
space for a packaged data format with a header containing timing and status information.
In order to be able to detect phase-shifts in the sampa adc sampling clock, caused by
single vent upsets in the i ternal clock divider network, this clock is made available that
it can be monitored. For this, the 10 bit adc output bus of the sampa was extended by
one further bit to 11 bit. This eleventh bit carries the adc clock so that it is transmitted
together with the data. Because the data of sampa 2 is shipped to two different receivers,
as it can be seen in figure 3.1, this eleventh output must also be received by both, giving
an effectiv width of the output of 12 bit for sampa 2. This results in a new bandwidth
requireme t of
(4× 11 bit + 12 bit)× 32× 10 MHz = 17.92 Gbit/s (3.4)
which is by chance exactly the bandwidth which is available by combining 4 gbtx asics of
4× 4.48 Gbit/s = 17.92 Gbit/s. So the solution to be able to transfer all data is to either
double the transmission bandwidth by using twice as many optical components, or to
reduce the data volume by a factor of two e.g. by halving the sampling frequency. Since
doubling the optical components would lead to substantial additional costs in the upgrade,
the reasons for the 10 MHz sampling rate had to be reevaluated. Originally, the alice
tpc was designed for a 5 MHz sampling rate to have roughly the same bin width in all
three spacial directions [29]. This was changed later due to a slightly better performance
of the ion tail cancellation filter with a 10 MHz sampling and a better sensitivity to cluster
tails with an applied zs. Since both facts will not matter anymore in the new system (no
zs can be applied and the gem setup does not generate ion tails) the impact of reducing
the sampling frequency on the physics performance was studied and presented in [33].
The simulations demonstrated that the Particle Identification (pid) performance as well
as the tracking efficiency and momentum resolution are influenced to only a very small
amount by the reduction of the sampling rate. Even with a low signal to noise ratio both
sampling frequencies behave similarly. Based on this study it was decided to reduce the
26
3.2 – The SAMPA Chip
normal mode split mode
serial links [9:5] serial links [4:0] serial links [9:5] serial links [4:0]
cycle ch [bits] ch [bits] ch [bits] ch [bits]
0 0 [9:5] 0 [4:0] 16 [4:0] 0 [4:0]
1 1 [9:5] 1 [4:0] 16 [9:5] 0 [9:5]
2 2 [9:5] 2 [4:0] 17 [4:0] 1 [4:0]
3 3 [9:5] 3 [4:0] 17 [9:5] 1 [9:5]
4 4 [9:5] 4 [4:0] 18 [4:0] 2 [4:0]
...
...
...
...
...
28 28 [9:5] 28 [4:0] 30 [4:0] 14 [4:0]
29 29 [9:5] 29 [4:0] 30 [9:5] 14 [9:5]
30 30 [9:5] 30 [4:0] 31 [4:0] 15 [4:0]
31 31 [9:5] 31 [4:0] 31 [9:5] 15 [9:5]
Table 3.1: The two different transmission modes of the sampa in das mode. In normal
mode, the 10 bit adc values of each channel are provided one-by-one to the output ports.
In split mode, the serial links are split in half, port 9–5 are used for channel 16–31 and
port 4–0 for channel 0–15. Each 10 bit adc value is therefore sent in two consecutive
cycles, first the five lsb, then the five msb. The pattern is repeated after 32 cycles in
both modes.
sampling frequency by a factor of two to be able to transmit the raw data uncompressed and
unmodified to the crus to apply the, for further processing necessary, baseline restoration
there. In the following, the three main components of the readout and online processing
chain are described in more detail with focus in the aspects which are most important for
the readout of the tpc.
3.2 The SAMPA Chip
The main motivation for the development of the sampa was the change in the readout
strategy in Run 3 towards a continuous readout. Therefore the presently used tpc fee
consisting of the 16 channel pasa asic (front-end amplifier and shaper) and the 16 channel
altro chip (10 bit adc and dsp) was developed further. The new sampa asic integrates
now 32 channels of the whole processing chain, which is indicated in figure 3.8. Each of
the 32 paths consist of a positive/negative polarity csa, a shaper, a 10 bit adc supporting
up to 20 Msamples/s whose data is then fed into a dsp. This dsp is capable of doing
additional processing on the digitised signals like applying different baseline correction
filters or compression algorithms. The sampa operation can be adapted with programmable
parameters so that the chip can be used by two different detector systems, the muon
chambers of the spectrometer and the tpc. [26]
The sampa can be used either with the integrated dsp enabled (and all its data processing
capabilities) or in das mode. The tpc will use the latter one to overcome the issues of
the not-applicable zs due to the cm effect. In this mode, the raw adc samples of the
27
Chapter 3 – Readout Strategy for the ALICE TPC
A B A B A B A Bwords 0–7:
A A B B A A B Bwords 8–15:
A B A B A B A Bwords 16–23:
A A B B A A B Bwords 24–31:
(a) sampa synchronisation pattern in normal mode.
A A B B A A B Bwords 0–7:
A A A A B B B Bwords 8–15:
A A B B A A B Bwords 16–23:
A A A A B B B Bwords 24–31:
(b) sampa synchronisation pattern in split mode.
Figure 3.9: The sampa synchronisation pattern in das mode, using A for 0x2B5 and
B for 0x14A, consists of 32 10 bit words. Figure (a) shows the pattern in normal mode
while figure (b) shows it in split mode.
32 channels are multiplexed to the ten output pins and sequentially transmitted. Because
the digital circuitry of the dsp is not needed it is powered down to reduce the power
consumption. In this operation mode, two transmission modes are available which are
using the ten available serial links in a different way, the normal mode and the split mode.
In normal mode, the 10 bit adc values of each channel are provided one-by-one to the
output links. Starting with channel 0 in the first cycle, channel 1 in the next cycle and so
on until channel 31 is reached. Then it is channel 0 again, but from the next time bin. In
the split mode, the serial links 4–0 are used for channel 0–15 while the links 9–5 are used
for channel 16–31. This is shown in table 3.1. Since the adc value remains a 10 bit word,
it must be split across two consecutive output cycles, always the five lsb first, then the
five msb as it is shown in the last two columns of the table.
In both transmission modes, the eleventh serial link is used to provide the internally
generated adc clock. With this it is possible to observe eventual phase shifts in the adc
clock, caused by single event upsets in the sampa internal clock divider network and
one can act accordingly, e.g. by resetting all sampas and resynchronise all fecs of the
tpc. It has to be noted that there is no fixed relation between the clock on this port and
the data on the other ports. That is why it can not be used to mark a specific channel.
The channel must be reconstructed in a different way. The sampa sends a continuous
stream of data in the das mode. To be able to identify channel 0 (and with that all
following channels), a synchronisation pattern is used. This synchronisation pattern is
sent once at the very beginning of the readout after the sampa receives a reset signal.
The pattern consists of 32 10 bit words, thus the complete pattern has the same length
as a readout cycle of all 32 channels. Two complementary values (0x2B4 and 0x14A) are
used to build the pattern, which switch in the way shown in figure 3.9. The top part
28
3.3 – The GBT System
03111
1
11
5
11
9
h sc data fec︸ ︷︷ ︸
G4
︸ ︷︷ ︸
G3
︸ ︷︷ ︸
G2
︸ ︷︷ ︸
G1
︸ ︷︷ ︸
G0
︸ ︷︷ ︸
G6
︸ ︷︷ ︸
G5
Figure 3.10: The gbt protocol, consisting of a 4 bit header (h), a 4 bit Slow Control
(sc) field with 2 bit for the gbtx internal control and 2 bit for the external control of
the gbt-sca, a 80 bit data field and 32 bit for the optional automatic Forward Error
Correction (fec). The different data groups of the wide bus mode are indicated below.
shows the switching in the normal mode, the bottom part in the split mode. Although
the patterns look quite different in the two modes, after de-interleaving the pattern of
the split mode, they are both identical. The synchronisation pattern is not only used to
identify the very first channel but also to synchronise the data from all sampas of all fecs.
Since they receive the reset signal at the same time (the signal propagation latency is
deterministic in both directions), they also start all at the same time with sampling and
sending the data [38].
3.3 The GBT System
The experiments at the lhc and other, e.g. future colliders, require high data rate links which
can sustain high radiation doses. The gbt project [39] addresses this issue by providing
a radiation hard on-detector asic, implementing a 4.8 Gbit/s bi-directional optical link
between experiment and the cr. The counter part of the system is located off-detector in
an environment without radiation and consists of an fpga, programmed to be compatible
with the gbt protocol, implementing the interface to further off-detector components. The
on-detector components which are relevant for the tpc readout are the gbtx and the
gbt-Slow Control Adapter asic (gbt-sca). The gbtx is a serialiser/deserialiser chip. Its
task is to provide the interface to the detector fee and the encoding and decoding of the
data into the gbt protocol. The serialised data is then transmitted at 4.8 Gbit/s. The
gbt-sca provides the Slow-Control (sc) interface. The chip is connected to dedicated pins
of the gbtx and implements commonly used control busses like i2c or jtag. It can also be
used to monitor environmental variables like temperatures and voltages [39].
3.3.1 The Transmission Protocol
The gbt protocol is a 120 bit wide frame, subdivided into four fields as shown in figure 3.10.
4 bit are reserved for a head and also 4 bit for the Slow-Control (sc) interface. The 4 bit
sc field is further subdivided into 2 bit for the internal control of the gbtx itself and 2 bit
for the external control of a sc interface, e.g. to connect the gbt-sca via those dedicated
output pins. Then there is a 80 bit wide data field and finally a 32 bit wide field for the
optional automatic Forward Error Correction (fec). On the receiving side, the frame is
updated with a frequency of 40 MHz. Three different frame formats are available for the
data transmission with the gbtx [40]:
29
Chapter 3 – Readout Strategy for the ALICE TPC
GBT frame format: This is the default frame format. The 32 bit fec field is used for an
automatic forward error correction of the remaining 88 bit, based on a Reed-Solomon
encoding. It was chosen in a way to provide a high level of error correction that is able
to deal also with bursts of up to 16 consecutive wrongly received bits. The encoding is
done before serialisation and decoding after deserialisation. In this way, any transmis-
sion errors or single event upsets can be corrected. This reliability in the transmission
is important especially while transmitting control and trigger signals. Therefore only
the 80 bit data field can be used for user data, resulting in a bandwidth of
80 bit× 40 MHz = 3.2 Gbit/s.
The tpc will use this frame format for the down-stream path, to configure the sampas
and send control signals to the fec to use the advantage of the automatic error
correction for the control path.
Wide bus mode: For applications were the bandwidth is more important than the pos-
sibility of an automatic error correction, the 32 bit fec field can also be used for
additional user data. This increases the available bandwidth by 40 % compared to
the gbt frame format to
(80 bit + 32 bit)× 40 MHz = 4.48 Gbit/s.
The tpc will use this format for the up-stream data path. Here, the bandwidth is
most important and since raw adc values without Huffman encoding are transmitted,
single bit-flips are tolerable.
8B/10B frame mode: For completeness, also the third mode is mentioned although it is
not foreseen to be used in the tpc readout scheme. On users request, the 8B/10B
frame mode was added for the up-stream transmission. The 120 bit frame is divided
into twelve 8B/10B∗ words of which eleven are available for user data since the first
one is needed for the synchronisation on the receiver side. This leads to a bandwidth of
11× 8 bit× 40 MHz = 3.52 Gbit/s
which is only marginally higher than the gbt frame format with the disadvantage
of the omitted forward error correction. The advantage of this mode is the reduced
amount of resources needed for the fpga implementation on the receiving side because
the error correction is skipped.
3.3.2 The SAMPA Data within the GBT Frame
The electrical interface between the gbtx and the Front-End Devices (feds) is realised
via so called eLinks. Each eLink consist of three signal lines, a clock line driven by the
gbtx, a data down-link to transmit data from the gbtx to the device and an up-link to
deliver data from the device to the gbtx. The setup of those lines, especially the data rate,
∗In an 8B/10B coding, a 8 bit word is transmitted as a 10 bit binary string to achieve dc-balancing [41].
30
3.3 – The GBT System
group gbt frame bits
0 [47:32]
1 [63:48]
2 [79:64]
3 [95:80]
4 [111:96]
5 [15:0]
6 [31:16]
Table 3.2: The bits of the gbt frame of the individual data groups in the wide bus
mode [40].
is programmable on a per group level and the number of eLinks available in each group
depends on the data rate. Each group consists of four eLinks for a data rate of 160 Mbit/s.
If the data rate is doubled, the number of eLinks is halved and vice versa.
The mapping of the input eLinks into the gbt frame is fixed and depends on the used
frame format and the configured data rate. For the wide bus mode, this mapping between
the eLink groups and the gbt frame is shown in figure 3.10. The exact bits of the frame
belonging to the individual groups can be found in table 3.2. The tpc will use a data
rate of 160 Mbit/s between the gbtx and the fed (the sampa chip). Therefore each group
consists of four eLinks. Since the connections between the eLinks and the sampa ports
are hardwired on the fec, the location of the data of the individual output pins of the
sampa within the gbt frame is purely determined by the layout of the fec and therefore
fixed during the design phase of the Printed Circuit Board (pcb). The connections were
done in a way that an easy and straightforward decoding is possible. At the time when the
discussion about the pcb layout took place, it was not yet decided to half the sampling
rate for the tpc from 10 MHz to 5 MHz. However, the raw data readout was already
settled. To cope with the expected data rates, a fec version with 4 gbtx and a data rate
of 320 Mbit/s between the gbtx and the feds was discussed, where the layouts figure 3.11a
and figure 3.11b are coming from. For better readability, the mapping between the sampa
and the groups is shown instead of the gbt frame bits. The ordering of the first version in
figure 3.11a is straightforward on the pcb connection side. Port 0 of sampa 0 is connected
to group 0 of gbtx 0 and then it is simply counted upwards with the port numbers of the
sampa (and then with the sampa id) on the one side and with the groups of the gbtx
(and then with the gbtx id) on the other side. The only exception is the eleventh port
of sampa 2 which is routed to gbtx 1 as well as gbtx 2. This port transmits the adc
sampling clock which is needed for monitoring purposes. Since the two gbtx groups 0/1
and 2/3 are foreseen to be connected to different crus, this clock needs to be transmitted
via both paths to be able to monitor the quality of the data independently in both crus
receiving the data from sampa 2.
In version two, the ordering is done in a way that the gbt frame from the individual
gbtx chips look very similar. This simplifies the decoding significantly. Since sampa 2 is
anyway a special case because of the eleventh eLink, it was decided to split only the ports
of sampa 2 across different gbtx chips, the data from the ports of all other sampas are
31
Chapter 3 – Readout Strategy for the ALICE TPC
sampa 1 [2:0] sampa 0 [10:0]gbtx 0
sampa 2 [10, 4:0] sampa 1 [10:3]gbtx 1
sampa 3 [7:0] sampa 2 [10, 9:5]gbtx 2
sampa 4 [10:0] sampa 3 [10:8]gbtx 3 ︸ ︷︷ ︸
G6
︸ ︷︷ ︸
G5
︸ ︷︷ ︸
G4
︸ ︷︷ ︸
G3
︸ ︷︷ ︸
G2
︸ ︷︷ ︸
G1
︸ ︷︷ ︸
G0
(a) Mapping of all sampa ports into four gbt frames. The ordering is straightforward on
the connection side, it was started with port 0 of sampa 0 connected to group 0 of gbtx 0
and then just continued counting upwards with the ports of the sampas on the one side
and with the groups of the gbtx on the other side. The only exception is the eleventh port
of sampa 2 which is routed to gbtx 1 as well as gbtx 2.
sampa 2 [2:0] sampa 0 [10:0]gbtx 0
sampa 2 [10, 4:3] sampa 1 [10:0]gbtx 1
sampa 2 [7:5] sampa 3 [10:0]gbtx 2
sampa 2 [10, 9:8] sampa 4 [10:0]gbtx 3 ︸ ︷︷ ︸
G6
︸ ︷︷ ︸
G5
︸ ︷︷ ︸
G4
︸ ︷︷ ︸
G3
︸ ︷︷ ︸
G2
︸ ︷︷ ︸
G1
︸ ︷︷ ︸
G0
(b) Mapping of all sampa ports into four gbt frames. The ordering is done in a way that
each gbt frame look very similar to the others. Since sampa 2 is anyway a special case
because of the eleventh port which needs to be routed to the gbtx 0/1 group as well as
gbtx 2/3, it was decided to split only the ports of this chip across different gbtx. The
ports of all other sampas are sent by only a single gbtx chip.
sampa 2 [10, 4:0] sampa 1 [10:0] sampa 0 [10:0]gbtx 0
sampa 2 [10, 9:5] sampa 4 [10:0] sampa 3 [10:0]gbtx 1 ︸ ︷︷ ︸
G6
︸ ︷︷ ︸
G5
︸ ︷︷ ︸
G4
︸ ︷︷ ︸
G3
︸ ︷︷ ︸
G2
︸ ︷︷ ︸
G1
︸ ︷︷ ︸
G0
(c) Mapping of all sampa ports into two gbt frames. After the readout frequency was
reduced to 5 MHz there was enough bandwidth available to transfer the data of 2.5 sampas
via a single gbtx. Still the sampa 2 remains a special case. Otherwise version (b) was
adapted to this solution.
Figure 3.11: Different versions of how to map the data from all five sampas into the
gbt frames. Version (a) and (b) originate from a time where the 10 MHz readout of the
tpc was still the baseline but the raw data readout was already settled. Version (b) was
then adapted to (c) after it was decided to reduce the readout frequency by a factor of
two. The data groups of the gbt frame (G0–G6) which were shown already in figure 3.10
are aligned for a better readability.
32
3.4 – The Common Readout Unit
kept within a single sampling chip. The layout of the sampa 2 connections is chosen in a
way that the two crus receive a similar pattern. The decision was then made in favour of
the second version because of two reasons:
1. Each frame of the different gbtx looks very similar which simplifies the decoding. To
decode the frames of version 1, four different decoders would have had to be written
for the different formats. sampa 2 is still a special case, but also this format is similar
when comparing the two gbtx groups 0/1 and 2/3.
2. This format is more failsafe in case of phase-shifts of the different sampling clocks
in the individual gbtx. These are still four individual gbtx asics, sampling the
data from the sampas independently. In version 1, three out of the five sampas are
split across two gbtx while in version 2 it is only the data of sampa 2, reducing the
danger of data loss.
After the reduction of the sampling frequency for the tpc from 10 MHz to 5 MHz, the
basic principle of the second layout was kept and adapted for a factor two less in data
volume and therefore only two gbtx asics on the fec. The resulting layout is shown in
figure 3.11c. The frame from gbtx 0 looks exactly the same as from gbtx 1, only the
origin of the content is different. Thanks to the split mode of the sampa, which was already
introduced in section 3.2, even the mapping for sampa 2 looks exactly the same in both
frames, making it possible to use the same decoder without any changes to decode the
frames from both gbtx chips of each fec. The reduction in readout frequency by a factor
of two went along with a reduction in the data rate between the gbtx and the fed by the
same factor. As a consequence, now each group contains the data of four eLinks instead of
only two. With that, the data of 2.5 sampas fit into one frame instead of only from one
sampa and a quarter of the ports from sampa 2.
To fill the user data of the gbt frame of 112 bit, the gbtx concatenates multiple time
bins of the input ports into a single frame. For the 320 Mbit/s case, eight time bins of each
eLink would have been put into a single frame. For the 160 Mbit/s case, this is reduced to
four. The exact layout of how the individual time bins are filled in the gbt frame and how
the decoding is done will be discussed in subsection 5.2.1.
3.4 The Common Readout Unit
The data of the tpc fecs is received by the crus. Those units act as the interface
between the on-detector systems, the Central Trigger Processor (ctp) which provides
trigger information and the lhc clock and the computing farm for the Detector Con-
trol System (dcs), further processing and data storage [42]. The cru was originally
developed by lhcb for their readout upgrade in perspective of lhc Run 3 under the
name pcie40, but will be used in alice as well. The main component for the process-
ing logic is an Intel Arria10 fpga, which is one of the most powerful fpgas currently
available on the market, with 1 150k Logic Elements (les) [43, 44]. A complete overview
of the available resources of the fpga is given in table 3.3. The most important re-
sources for the later development are beside the huge amount of les, the amount of
memory which can be stored in the m20ks of 54.260 kbit and the number of dsp blocks
33
Chapter 3 – Readout Strategy for the ALICE TPC
Product line GX 1150
les 1 150k
alms 427 200
Register 1 708 800
Memory (kbit) m20ks 54 260mlabs 12 984
Variable-precision dsp blocks 1 518
18× 19 multipliers 3 036
plls Fractional Synthesis 32I/O 16
17.4 Gbit/s transceivers 96
GPIO 768
LVDS pairs 384
PCIe hard IP blocks 4
Hard memory controller 16
Table 3.3: Resources of the Intel Arria10 (10AX115S3F45E2SG) fpga of the cru, taken
from [45].
of 1 518. The memory corresponds to 2 713 individual m20k ram blocks. An image
of the cru is shown in figure 3.12, where the fpga is nicely visible in the centre of
the card.
The cru is, as the name suggests, a common solution for many of the detectors in
alice to interface the data acquisition system. The card provides up to 48 bidirec-
tional optical links of which the tpc will use up to 20 to connect to the fee. The
number of links, which is equivalent to the number of connected fecs, used in the in-
dividual readout regions is given in table B.1 and ranges from 15 to 20. It must be
noted that each cru receives the data of only half a fec, as already shown in figure 3.1,
or phrased the other way around, each fec is read out by two different crus. Since
the fec needs to be controlled by one single master, the relationship between a fec
and its two crus is not symmetric. One is the master which implements the up-link
(for data readout) and the down-link (for control) via the vtrx and gbtx 0, while the
other cru is the slave, implementing only the up-link to receive data via the vttx and
gbtx 1. The communication with the fecs is done using the gbt protocol in wide bus
mode. Therefore, the maximum data input rate to a single cru sums up to a total
amount of
20× 112 bit× 40 MHz = 89.6 Gbit/s (3.5)
which needs to be either compressed or transferred via the pcie bus to the host machine,
the flp. Here, the data will be further processed and sent via a high performance net-
work connection to the Event Processing Nodes (epns) where the data of multiple flps
is combined and a calibration and the tracking can take place. Another optical link is
implemented to receive the trigger information from and send status information to the
ctp. This is the Trigger, Timing and clock distribution System (tts) link, providing the
baseline lhc clock as well as the experiment wide Heart Beat (hb) signal and further
34
3.4 – The Common Readout Unit
Figure 3.12: Image of the cru version 1. The heat-sink of the fpga was removed
so that the Arria10 is visible. The optical components of the 48 bidirectional links are
located below the still mounted heat-sinks on the back. The vertical pcbs on the right
side are needed for the power supply of the card and were redesigned in future versions of
the cru. This photo was kindly provided by [46].
35
Chapter 3 – Readout Strategy for the ALICE TPC
trigger and timing signals. The hb signal is used to synchronise all electronics of the whole
experiment and is issued every 89.4µs.
The interface between the host machine and the cru is pcie generation 3 with 16 lanes
providing a practical sustainable bandwidth of 90 Gbit/s [42]. This means that the tpc
will be able to also dump the raw data to the flp, at least for short time periods. This
possibility to examine the raw gbt frames in software will be needed for calibration and
debugging purposes. Though, in normal data taking, the output data rate of the cru is
expected to be at least a factor five less. This reduction is achieved by a Cluster Finder
(cf) running on the fpga. This cf achieves the compression factor of five by first applying
an intrinsic zs, because only physically relevant information is taken into account, and
second by converting the data into the Cluster Data Format (cdf). Since the average pad
occupancy is expected to be at most 30 % [5], the zs will give a compression of at least
3.3, while the conversion into the cdf gives another factor of 250/160 ≈ 1.56, because the
250 bit of all contributing adc values to the cluster will form a single cluster word with a
size of 160 bit. Details can be found in section 5.3. This rough estimation leaves out the
possibility of overlapping clusters, nevertheless, it gives a first idea about the maximum
expected output data rate of
89.6 Gbit/s× 0.3× 160250 = 17.2 Gbit/s = 2.15 GB/s (3.6)
per cru after applying the cf. This compression factor depends on the occupancy in the
tpc. If the occupancy increases, more clusters are found and the data rate increases as
well. The factor is further enhanced by additional postprocessing steps in the flp by a
reformatting of the data and a Huffman encoding.
In general, there are different readout modes foreseen to be implemented in the cru. Each
mode serves a different purpose. The default mode is the clusterised data output, mainly
used in normal physics. Here, the data volume is the lowest while keeping all physically
relevant information. For debugging and calibration tasks, it must also be possible in the
final system to write out the raw gbt frames or data from the intermediate steps of cf
processing to the flp, coming along with the price of an increased data volume [42].
Location of the CRUs
The crus will be located off-detector in the crs. Together with the flps they will be
located in cr 1 [47], which is the uppermost one. Those rooms are placed in the shaft which
is going down to the experiment, as can be seen in figure 3.13. Since they are sitting behind
the shielding, they are outside of the radiation area. This facilitates the development of
the hard- and firmware, as no special attention has to be paid to a radiation-hard design.
36
3.4 – The Common Readout Unit
187
9 Installation, slow control and safety
9.1 Implementation and infrastructure
9.1.1 ALICE experimental area
The ALICE detector will be installed at Point 2 of the LHC accelerator: the experimental area designed
for the L3 experiment. The main access shaft, 23 m in diameter, provides a 15 7 m2 installation
passage and space for counting rooms. The counting rooms are separated from the experimental area
by a concrete shielding plug (see Fig. 9.1). The experimental cavern is 21.4 m in diameter and will be
re-equipped with a 2 20 t crane having a clearance of about 3 m over the L3 magnet.
The L3 magnet provides a 11.6 m long and 11.2 m diameter solenoidal field of up to 0.5 T. The end-
caps have a door-like construction. The door-frames will support large beams traversing the L3 magnet,
from which the ALICE central detectors will be supported.
Figure 9.1: General layout of the basic underground structures at Point 2, showing the L3 magnet and the
counting rooms.
Figure 3.13: L t of the lice underground a ea. The L3 magnet is shown as well
as the crs, taken from [29].
37

Chapter 4
The CRU Firmware
Before starting with the Cluster Finder (cf), which is the core of the processing logic of
the tpc, an overview is given of the general layout of the fw design for the cru. Since
many different detectors in the alice experiment with individual requirements will use
the cru in Run 3, the fw design structure must be able to cope with this diversity. Some
of the detectors need the cru just to interface the data acquisition and the Detector
Control System (dcs) — they do not need the processing capabilities of the available
fpga — while others need very specialised functionalities like the cluster finding and
additional preprocessing steps as it is the case of the tpc. So the fw must be structured in
a modular way with clearly defined interfaces. This chapter explains first the basic layout
of those modules and then describes the interfaces.
4.1 A Modular Firmware Concept
In order to cope with the variety of different requirements, the fw is designed in a modular
way. Modules which will be needed by most of the detector teams are developed centrally
and can be used by the individual detectors if they fit their needs. The detector specific
part is then combined in the User Logic (ul) which falls completely into the responsibility
of each individual detector team. A schematic layout of the main building blocks of such a
design layout is given in figure 4.1. As can be seen, the modules besides the ul are mainly
interface modules and, not shown here, some small helper modules for individual tasks like
Clock-Domain Crossings (cdcs) or the configuration of the ul. There are four interfaces
which must be implented, first the one to the Central Trigger Processor (ctp), called the
Trigger, Timing and clock distribution System (tts) interface. This module provides the
trigger information and the baseline clock with a frequency of 240 MHz (called the ttcrx
clock in the following) for the design and delivers status information back to the ctp. The
connection to the ctp is done via ttc-pon links and goes through the Local Trigger Unit
(ltu). The ltu is a hardware interface module, provided by the ctp to have a uniform
interface to all the detectors. The second interface, which is split into two separate ones,
is the one through the pcie-bus with the First Level Processor (flp). One part is the
dma engine, taking care of shifting the big data volumes of the detector readout from the
cru to the flp memory. The other part is the configuration and dcs path, also realised
39
Chapter 4 – The CRU Firmware
CRU
detector
specific
User Logic
GBT
wrapper
DMA
and
PCIe
TTS interface
DCS, CRU configuration, . . .
..
.
FEC
FEC
FLP
LTU/
CTP
Figure 4.1: The main blocks of the cru fw. They can be roughly categorised into
the data path (blue and red), which is going from the fecs on the left side via the gbt
wrapper, the detector specific User Logic and the dma engine to the flp on the right side,
the configuration and dcs modules shown in green and trigger and clock distribution
in yellow.
through pcie. The fourth and last interface is towards the Front-End Electronics (fee).
Since most of the detectors which use the cru will also use the gbt system and with
that the gbt protocol, there is a wrapper available to encapsulate the individual gbt core
modules of each single link into a bigger element. The gbt core modules are developed by
the gbt group together with the corresponding asics. This wrapper implements a uniform
interface and takes care of the cdc of the individual links into the common ttcrx clock.
The individual gbt cores use internally a clock that is recovered from the data stream to
deserialise and decode the data to be able to provide the frame format of the gbt protocol.
Since the next module in the data path is the ul which combines the information from
several links (even if it is just forwarded to the flp’s memory without any changes) the
data has to be in a common clock domain. Thus it makes sense to implement the crossing
already at the very beginning.
Clock-Domain Crossings
Before continuing, first a few general comments about cdcs are given. Whenever there is
a signal crossing two domains with an asynchronous or unrelated clock, one has to deal
with the effect of metastability. In a metastable state, the output of a register is not
clearly defined. In fpgas, or rather all digital devices, the registers have defined signal
timings. With that it is possible that the input signal is captured correctly and the output
is produced accordingly. The important timings here are the register setup time (the
minimum time the input must be at a stable state before the clock edge) and the register
hold time (the minimum time the input must still be stable after the clock edge). If those
40
4.1 – A Modular Firmware Concept
clock 0 domain clock 1 domain
synchronisation chain
D Q D Q D Q. . .inputsignal
output
signal
clock 0 clock 1
Figure 4.2: Synchronisation register chain. The input signal is synchronised from the
clock 0 domain into the clock 1 domain with a set of successive registers.
timings are violated by the signal transition, the signal might be sampled incorrectly and
the output of the register might be metastable. If this output signal is then used further, its
state is not deterministic and can lead to unintended effects in the logic. In a synchronous
design, the signals must always meet the register timings to avoid metastability. Usually the
fitter which compiles the code takes care of achieving all timing requirements and reports
paths that violate the conditions so that the developer can examine the logic again [48].
Metastability issues mostly occur for signals crossing domains with different, unrelated or
asynchronous clocks. In such a case it can not be guaranteed that the timing requirements
are always met by the signals. There are various different techniques to minimise the
failures due to metastability issues, depending on the actual use case. The most commonly
used method to transfer asynchronous signals is via a synchronisation register chain. This
is a sequence of registers where all registers in the chain are clocked with the same clock
(or a phase-related one) except for the very first register which is located in the unrelated
clock domain. Each register in the chain, except for the last one, fans out to only a single
other register. This concept is shown in figure 4.2 where the first register is in the domain
of clock 0 and all others in the domain of clock 1. The length of the chain can be extended
for a better metastability protection but is required to having a length of at least two
registers. The first register in the chain will have an unpredictable, eventually metastable,
output which is then recovered by the second one in the chain, assuming there is enough
time for the metastability to settle down. Otherwise, more registers have to be added to
improve the synchronisation behaviour but also adding additional latency to the signal
propagation time. This concept is simple and works pretty well for clocks with either
the same frequency or when the frequency of the receiving clock domain is faster than
the one from the sending domain. It has to be noted that this kind of synchronisation
should be used only for individual control signals and not for data buses with several
correlated bits. For a rising edge in the data input signal, the three possible outputs of the
first stage of the synchronisation register chain are either low, metastable or already high,
depending on the relation between the arrival time of the signal and the sampling clock
at this very moment. So if one would try to synchronise a set of signals via individual
synchronisation register chains, some of the signals might be sampled earlier than others of
the same set of signals which messes up the output although all signals were synchronised
correctly, because the individual latencies and, with that, the states can be different. For
41
Chapter 4 – The CRU Firmware
synchronisation applications with several correlated bits one should realise the cdc in
a different way. To name just two standard approaches, there would be a handshaking
mechanisms or the synchronisation via dual-clock fifos.
In a handshaking mechanism, the correlated data signals at the input are kept constant
until the message arrives that they were correctly synchronised into the target clock domain.
For this purpose an additional control signal is used, informing the receiving side that new
data is available. After the data signals are sampled in the target domain, another control
signal is sent back to the source domain to release the data there. With this approach
only the two control signals have to be synchronised into the respective clock domains, for
example via synchronisation chain registers, while ensuring that an arbitrary wide data
bus is correctly synchronised. Since the control signals have to go forth and back over the
domain, there is some, eventually not negligible, latency for the handshake.
A dual-clock fifo is a fifo were the input ports can be clocked with a different clock
than the output ports. With this, the data can be transferred from one clock domain into
another. The data is then written and read from a ram while internally the available
read and write addresses are properly synchronised from one clock domain into the other
(for example by using Gray-Code counters∗) and the flags for both interfaces, like full and
empty flags, are generated directly in the correct clock domain. Those dual-clock fifos are
usually provided as soft (synthesisable code, provided in a hardware description language)
or hard (not changeable) ip cores by the vendors, ready to be used.
4.2 Interfaces to the User Logic
It is very important to clarify the interfaces to the ul before going into the details. Those
interfaces are the connections to the outside world and must be fixed at an early state
of the development process, since many details of the further implementation depend on
this. Therefore, it is explained in detail in the following. For the data path, the important
interfaces are to the gbt wrapper to receive the gbt frames and to the dma engine to
send the processed results. The interface to the tts module provides a base clock and all
necessary trigger information. In the following, the word input is used for paths going into
the ul, while output describes paths going out of the ul.
4.2.1 GBT Wrapper
-- GBT down -link (CRU -> FE)
gbt_tx_ready_i : in std_logic_vector (23 downto 0);
gbt_tx_bus_o : out t_cru_gbt_array (23 downto 0);
-- GBT up -link (FEE -> CRU)
gbt_rx_ready_i : in std_logic_vector (23 downto 0);
gbt_rx_bus_i : in t_cru_gbt_array (23 downto 0);
Listing 4.1: Vhdl interface to the gbt wrapper for 24 links.
∗The Gray-Code, named after Frank Gray, is a different representation of the binary numeral system in
such a way that successive values differ by only one bit. Because of this, slightly different timings in the
individual bits do not lead to short-lived wrong states, the value is changed only slightly later or earlier.
42
4.2 – Interfaces to the User Logic
The clock domain of the interface to the gbt wrapper is the already mentioned ttcrx clock.
The interface itself consists of two almost identical parts, the tx part for the down-link
(cru → fee) and the rx part for the up-link (fee → cru). For both of the paths there is
an input vector providing a status bit for each link. This bit indicates the readiness of the
specified link for either a down-link communication or to receive data. In addition there is
the data bus for both, up- and down-link. These buses consist of an array of records, one
per link, containing all the relevant information. Each record contains two individual bits
to indicate the validity of the current data. One flag for the general status information
and one flag which is valid one out of six Clock Cycles (ccs) to indicate that the data can
be used. This ratio comes from the ratio of the two involved clocks, the 240 MHz in which
the interface is implemented and the basic 40 MHz clock of the gbt protocol. Additionally,
there is the 4 bit Slow-Control (sc) field available and a 112 bit vector combining the data
field and the fec field. Since the tpc is using the gbt in wide bus mode, all the 112 bit
are relevant. It has to be mentioned that the ordering of the data within this vector follows
the group ordering and not the bit ordering of the gbt frame (see figure 3.10). This means
that the normal data field is located in the bits [79:0] and the additional 32 bit of the fec
field can be found in the bits [111:80]. That is a very important detail for the usage of the
data in wide bus mode within the ul [49].
4.2.2 DMA Engine
-- Endpoint 0
FCLK0 : out std_logic;
FVAL0 : out std_logic;
FSOP0 : out std_logic;
FEOP0 : out std_logic;
FD0 : out std_logic_vector (255 downto 0);
-- Endpoint 1
FCLK1 : out std_logic;
FVAL1 : out std_logic;
FSOP1 : out std_logic;
FEOP1 : out std_logic;
FD1 : out std_logic_vector (255 downto 0);
Listing 4.2: Vhdl interface to the data path wrapper.
The interface to the dma engine is implemented via a data path wrapper in between.
Therefore, the dma engine is not directly accessible from the ul. The data path wrapper
selects the data streams which are transmitted to the flp (could be the output of the
ul or directly the individual gbt links) and does the flow control. Nevertheless, the
actual hardware is still reflected in the interface to the wrapper by providing the two pcie
endpoints of the cru individually.
For both endpoints the interface is implemented as a dual-clock fifo. In this way the
clock used in the ul, or at least the clock used to write the output data, is independent
of the clock used to transmit the data via the pcie bus. The user has only a basic fifo
interface to write data from the ul to the host machine: a clock FCLK0/1 needs to be
provided which is used to write to the fifo, a valid flag FVAL0/1 to enable the writing and
43
Chapter 4 – The CRU Firmware
FCLK
FVAL
FSOP
FEOP
FD H0 H1 d0 d1 d2 d3 d4 d5 dn−1 dn
Figure 4.3: The transmission protocol towards the data path wrapper. The data stream
consists of two header words, followed by the actual data. The sop flag must be high for
the very first word of a package, while the eop marks the very last one. The valid flag is
used to specify when the data is to be used, based on [49].
FD[31:0] FD[63:32] FD[95:64] FD[127:96] FD[159:128] FD[191:160] FD[223:192] FD[255:224]
0 1 2 3 4 5 6 7
Figure 4.4: Mapping of the data bus into the flp memory. The 256 bit data words are
chopped into pieces of 32 bit of which the least significant one appears first in memory.
the data bus FD0/1 containing a 256 bit data word which is written. There are no full or
empty flags provided as they usually can be found in a fifo interface since the data path
wrapper checks internally the level of the fifo and informs the ctp via a dedicated path
in case of an overflow to synchronously drop the data across all crus. Two more bits need
to be set, the Start-Of-Packet (sop), in the interface description marked with FSOP0/1,
and the End-Of-Packet (eop), in the interface description marked with FEOP0/1. They are
needed by the protocol shown in figure 4.3, which must be complied with. As usual, the
valid flag marks when the 256 bit data word is valid and is used as a write enable. So only
those data words with the valid high are transferred to the host memory. Therefore, the
words do not have to be concatenated, a valid sequence may also have gaps in between.
The sop marks the beginning of a new packet, starting with the first word of the in total
512 bit wide Raw Data Header (rdh), while the eop marks the end of the packet after
which a new packet must be started, again with the sop. Everything in between belongs
to the same header, regardless of how often the valid signal has toggled [49]. A detailed
description of the data header and its fields is given in appendix C.
The rdh is actually a sequence of four 128 bit words as can be seen in figure C.1. Since
the bus towards the data path wrapper (and via the pcie) has a width of 256 bit the headers
H0 and H1 (the first two data words in figure 4.3) must be composed of two rdhs words
each. It must be noted that the data is represented in little endian, so the bus FD[255:0] is
mapped into the flps memory as a sequence of eight 32 bit words in the order shown in
figure 4.4. Thus, requiring that rdh 0 appears in the memory before rdh 1, the lower bits
of H0 must belong to rdh 0 and the higher bits to rdh 1, which is maybe counterintuitive
when just looking at the protocol. The same is true for H1 with rdh 2 and rdh 3.
It is also possible that a package contains no data at all and consists only of the two
header words. Then, the sop flag is set as usual for H0 and the eop must be set for H1.
This might be useful to finalise a series of packets belonging to the same Heart Beat (hb)
44
4.2 – Interfaces to the User Logic
trigger via the stop bit of the corresponding header field. More details about the hb can
be found in subsection 5.3.5.
4.2.3 TTS Interface
TTC_RXCLK : in std_logic;
TTC_RXRST : in std_logic;
TTC_RXREADY : in std_logic;
TTC_RXVALID : in std_logic;
TTC_RXD : in std_logic_vector (199 downto 0);
Listing 4.3: Vhdl interface to the tts module.
The main signals coming from the Trigger, Timing and clock distribution System (tts) are
the clock and the trigger signals as well as global control commands. This is also reflected
in the interface towards the corresponding module. The baseline clock for the ul, the
240 MHz ttcrx clock, is received from the TTC RXCLK port. There is also the possibility
to send a global reset, TTC RXRST, and a ready flag to indicate the overall validity of the
signals, TTC RXREADY. The data bus TTC RXD delivers the trigger pulses and bunch crossing
information to synchronise the readout of all the crus from all detectors. TTC RXVALID
marks, when this bus is ready and the content can be used. The meaning of the individual
bits of the data bus is described in [50]. The main signal, currently used by the ul from
the tts, is the hb signal which is located in bit 1 of the TTC RXD bus [49].
45

Chapter 5
A 2D Cluster Finder for the TPC
The following chapter covers the main topic of this thesis, the development of the cf for the
tpc. This cf will run on the fpga of the Common Readout Unit (cru) and reconstructs
the charge clusters in real-time during the readout. This is the most challenging and
resource consuming part of the Firmware (fw) design. There are various steps before
the actual charge clusters can be found on a representation of the tpc pad plane in the
cru, like the decoding of the gbt frames, resorting of the individual pads based on the
configuration of the readout card and the Baseline Correction (blc). All those individual
modules are covered in the following sections.
5.1 Overview of the TPC User Logic
The User Logic (ul) is the detector specific part in the cru. For the tpc it implements
mainly a cf. But there is more which needs to be done in order to provide usable adc
values to the cf network and to store the final clusters in the memory of the First Level
Processor (flp). An overview of the main building blocks is given in figure 5.1. The
processing chain can be subdivided into two parts, a data preparation part and a cluster
reconstruction part. They are logically decoupled and placed after each other. There is
one fundamental principle behind the overall design of the ul: keep the data as separated
as possible, do not merge paths unnecessarily. The tpc ul needs to process the data of
up to 1 600 individual pads. Each pad provides a 10 bit adc value, which will be a 14 bit
Fixed-Point (fp) number, in the following shown as 10.4 fp number with 10 bit for the
integer part and 4 bit for the decimal part, after the blc. The four additional bits are
needed for a sufficient precision of the correction. This sums up to 16 000 (22 400) paths
which need to be routed together once they are merged. To reduce the effort for the fitter
during the compilation of the design and to achieve timing closure, the individual parts
which can be kept separated must also be kept separated. This is realised by doing the
preparation up to the sorting for each input link individually. The links are completely
separated and do not influence each other. Since clusters need to be found within single
rows of the pad plane, the processing is done after the sorting for the rows independently.
To reduce the complexity further for the single cf instances, each row is subdivided into
smaller row-segments with a width of only a few pads. Here, each cf instance can find
47
Chapter 5 – A 2D Cluster Finder for the TPC
data preparation cluster reconstruction
GBT dec. sorting BLC CF CP merging
CM calc.
Figure 5.1: A simplified block diagram of the tpc ul, showing the main building
blocks, subdivided into the data preparation and cluster reconstruction parts. The
data preparation part consist of the decoding of the gbt frames, a sorting algorithm to
reassemble a pad plane representation and the blc for which the cm contribution must be
calculated. To reconstruct the clusters, they must first be found with the cf after which
the cp can calculate the cluster properties. In the end, the data from all cp instances
must be merged to transfer them to the flp.
clusters autonomously in its individual pad plane range. The data is merged only at the
very end were it is absolutely necessary to write all clusters to one of the two output
fifos of the Direct Memory Access (dma) engine. An inevitable crossing occurs in the
sorting module where the data of all links have to be combined to reassemble a pad plane
representation from which the rows are extracted and separated again.
5.2 Data Preparation
To prepare the data in a way that charge clusters can be found, three basic steps need
to be done. First, the gbt frames must be decoded. On average eight gbt frames are
needed to get all channels of one time bin. Afterwards, the channels must be sorted. The
data arrives on each link from a different Front-End Card (fec) in the order of the sampa
channels. To reassemble a representation of the pad plane, the mapping of the individual
sampa channels to a specific pad position needs to be applied. This process is very costly
because the mapping is different for each fec in each readout region hence for each cru in
a sector. There are two basic approaches to overcome this issue, either leaving the mapping
completely free (or at least as free as necessary, that one of the ten different mappings
can be selected) via the configuration of the cru, or having a fixed mapping for the ten
different regions and compiling ten versions of the fw. The first approach leads to huge
routing matrices, since in the worst case, each of the 1 600 input channels could end on a
different pad in each of the ten regions. With a 14 bit number, this are 224 000 possible
paths which all have to be realised in hardware and made selectable.
On the other hand, having the mapping fixed reduces this number significantly. However,
one has to put way more effort into maintaining the whole system. After a single change,
ten different fws have to be compiled. This introduces first computational complications
as the compilation time for a single fw for the Arria10 is around 10 h as soon as the
fpga becomes relatively full. This can be targeted with a parallel compilation on different
48
5.2 – Data Preparation
machines. Though, one has to keep in mind that a rather big machine is needed to achieve
those compilation times. Intel recommends on its website [51] to have at least 18–48 GB
of physical ram to compile a fw for the Arria10 and the machine which achieved the
compilation time of 10 h had more than 16 cpu cores, which is the maximum number the
software is currently able to utilise for the overall compilation procedure [52]. Then again,
these are just technical details which could easily be solved if the need exists. The bigger
issue is that in fact ten different fws must be compiled, with the danger that for some of
them it is maybe harder to achieve timing closure because of a more complex logic and
a different routing. This would perhaps require a separate treatment of the designs in
also other aspects, than just the different mapping. This makes it very hard to achieve
a homogeneous system in the end. Therefore, it was decided to have of a more complex
design and a configurable mapping which fits all regions with the advantage to have just
one fw. How this is achieved is described in subsection 5.2.2.
After the sorting is done, the baseline must be corrected. The correction includes three
different components, a pedestal subtraction to subtract the offset of the electronics, a
gain correction to compensate for gain variations in the detector and the Common Mode
(cm) correction which was already discussed. The pedestal value is different for each pad
and needs to be subtracted from the adc values. Its origin lies in the electronics, while
the cm is due to the capacitative coupling of the individual pads and results in a global
downwards shift of the baseline and therefore has to be added to the adc value. The gain
correction, however, is a multiplicative adjustment and corrects for non-uniformities of
the amplification in the Gas Electron Multiplier (gem) stack. The blc is done only after
the sorting because it increases the data volume. A 10 bit integer value is expanded to
a 10.4 fp number, which would require the sorting of 40 % more signal paths if applied
beforehand.
In parallel to the sorting, the cm calculation is done. This module calculates the average
adc value of all pads without a signal peak. This does not require a sorting in advance
because all pads are summed up, independent of the location. The peak exclusion is done
as a function of the sampa channel instead of the pad. In this way, the peak is detected
only in time direction but it was seen in the studies about the cm correction that this
is sufficient [33, 53]. Before the summation can be done, the pedestal value needs to be
subtracted as well. This can also be done as function of the sampa channel, but it has
to be assured that the same value is used in this part of the design as in the actual blc
module. The gain should not be corrected before since the origin of the cm is the signals
after the amplification in the gem stack, so the gain variation is part of the cm signal.
The processing is done in a pipelined way. The data comes in sequentially, channel
after channel (pad after pad subsequent to the sorting), and is processed in this order.
There is no buffering needed in between. Due to this design approach, the data flow
is completely deterministic and a time information can be assigned only later, during
the peak fining, when it is needed. As a reminder, there is no time information sent
together with the raw data from the fecs, this needs to be added within the cru. If the
time would have to be added at the very beginning, one would need to propagate this
information properly throughout the whole design together with the data. However, since
the latency is deterministic, there is only a constant offset which can be added also later in
software. The cm calculation is an exception in this sense because here at least two time
49
Chapter 5 – A 2D Cluster Finder for the TPC
037111519232731353943475155596367717579838791959910
3
10
7
11
1
cl
k
cl
k
cl
k
8 7 6 5 4 3 2 1 0 10 9 8 7 6 5 4 3 2 1 0 10 4 3 2 1 0 10 9
1 0 2 1
Figure 5.2: Detailed mapping of the sampa output into the gbt frame (without header
and sc). The first line shows the individual bits of the output, with the time ordering
highlighted (yellow first, blue last). The same scheme applies also for the three clock
fields marked in green. The second row shows the sampa eLink output number and the
sampa chip id is given in the third row. The sampa informations are given for even
region numbers.
bins are needed for the peak finding and the exclusion. Then a buffering is needed and
done locally. Since this module is not part of the actual processing chain — it receives
a copy of the adc values, does the cm calculation independently of the other logic and
provides a single value to the blc module where it is applied — this is not an issue.
The only condition is that the result of this module, the average adc value of all pads
without a signal peak for each individual time bin, arrives at the right time in the blc
module.
5.2.1 Decoding of the GBT Frames
The very first module of the tpc ul is the gbt decoder. This module is instantiated for each
input link separately, it thus receives the data from one half of a fec, or from 2.5 sampas.
In section 3.2 it was already discussed that the tpc will use all sampas in the same
configuration, the Direct ADC Serialisation (das) mode with the split mode transmission,
to be able to decode the data from all sources in the same manner. Therefore, the adc
sampling clock of the three source sampas as well as the five Half-Word (hw) sequences
are contained in each gbt frame. Figure 5.2 combines the information of figure 3.10 and
figure 3.11c and indicates the purpose of the individual bits of the frame. The figure shows
the mapping for the even regions. The odd regions are similar, but sampa 0 is replaced
with 3, sampa 1 with 4 and the eLinks 0 to 4 of sampa 2 are replaced with the eLinks
5 to 9. The ordering of the eLinks within the frame is purely determined by the fec
layout. It depends only on how the sampas are connected to the input groups of the gbtx
asics. It can be seen that each eLink contributes 4 bit to the frame. These are temporally
consecutive output cycles of the sampa where the msb (marked in yellow) is the first one
and the lsb (marked in blue) the last one. The same applies for the clock fields which
are marked in green to guide the eye. All bits of a sampa with the same colour belong
together and form in total five hws. The rearrangement to combine the same coloured bits
to form the individual hws is done in the HW assembling block of figure 5.3. Afterwards,
each of the five hw sequences can be analysed independently.
The fields containing the 5 MHz sampling clock of the sampas are extracted separately
and monitored by a dedicated module, again independently for each of the three sources.
The input ports of the gbtx asic are sampled with a 160 MHz clock. If a 5 MHz clock
50
5.2 – Data Preparation
is sampled with 160 MHz, the result will be a periodic pattern with a length of 32 bit
where the signal is expected to be low for 16 bit and afterwards high for also 16 bit. Is
this pattern now chopped into pieces with a length of 4 bit to transmit it via the gbt
frame, the resulting (correct) sequence can only be one of the four possibilities shown
in figure 5.4. To monitor the adc clock, it is sufficient to look for the pattern with the
rising edge (the second field in each sequence) and check that the following sequence is
as expected. This can easily be implemented in a simple Finite State Machine (fsm). The
fsm remains in the ground state until one of the first transition patterns is seen, locks
to the corresponding sequence number and just passes through the following eight states.
In each state the expected pattern is checked and if the input deviates, the fsm returns
to the ground state. If all states are passed through once successfully, the adc clock was
correctly recognised. After that, the fsm continues to look for the next pattern of the same
sequence and any deviation is reported as an error. A higher level of Detector Control
System (dcs) monitoring modules can then act accordingly, e.g. by sending a synchronous
reset signal to all the fecs and restarting the readout after accumulating some critical
number of failures.
Decoding the other hws to reassemble the adc values is a bit more tricky. As already
mentioned in section 3.2, the sampas in das mode send a continuous stream of data
without any header information. The only way to tell which channel is currently received
is the synchronisation pattern (figure 3.9) at the very beginning. So this pattern must
be recognised and afterwards the channel can be determined by counting the number of
received values. The channel number follows the ordering given in table 3.1 for the two hw
sequences of the split mode. The detection of the synchronisation pattern is realised in the
same way as the monitoring of the adc clock: a fsm goes through all its eight states and
checks if the expected pattern is received. The fsm needs only eight states instead of 32,
which is the length of the synchronisation pattern, because four hws are received at once
(the four different colours of figure 5.2), analogously to the four temporally consecutive
bits of the adc clock. This has the same consequence that the synchronisation pattern can
start with one out of the four simultaneously received hws. The simplest case is when the
pattern starts at position zero, which is shown in figure 5.5 as sequence 0. Then the four
5 bit hws contained in one gbt frame form two 10 bit adc values. In all other cases the
content of two gbt frames is needed to reassemble the two adc values. The sequences
0 and 2 look very similar in that sense, since here, also four hws are contained within
one gbt frame which belong to two complete adc values (hwn−1 and hwn). The issue in
this case is the very first frame after a reset where the adc value n − 1 is the last part
of the synchronisation pattern. In this case, the decoder could provide only one value
instead of two. This would shift all following outputs by one channel. Since the output
of the decoder module should always be the same, independent of the location of the
synchronisation pattern in the gbt frame, to simplify the further processing, the content
of two gbt frames need to be merged to form a reliable output sequence. This decoding of
the channels, meaning that the detection of the synchronisation pattern and merging of
the hws to the correct adc values is done for each of the five streams individually, as can
be seen in figure 5.3. Each of the channel decoder will therefore provide two adc values
for each incoming gbt frame, together with an id for the channel number. Additionally,
more status information is provided about the detection of the synchronisation pattern,
51
Chapter 5 – A 2D Cluster Finder for the TPC
channel dec. 4
channel dec. 3
HW assembling
clk mon. 2
clk mon. 1
clk mon. 0
GBT frame channel dec. 2
channel dec. 1
channel dec. 0
output FSM
adco [1:0]
ido [3:0]
Figure 5.3: Block diagram of the gbt decoder. The HW assembling does the rearrange-
ment of the individual bits of the gbt frame to form the individual hws which are then
analysed by the channel decoders. This module looks for the synchronisation pattern,
combines two hws to an adc value and assigns a channel number. The fsm at the output
takes care of a serialised data stream. In parallel, the adc sampling clocks of the sampas
is monitored by the corresponding modules.
Sequence 0: ... 0000 1111 1111 1111 1111 0000 0000 0000 0000 1111 ...
Sequence 1: ... 0000 0111 1111 1111 1111 1000 0000 0000 0000 0111 ...
Sequence 2: ... 0000 0011 1111 1111 1111 1100 0000 0000 0000 0011 ...
Sequence 3: ... 0000 0001 1111 1111 1111 1110 0000 0000 0000 0001 ...
Figure 5.4: The four valid adc clock sequences. Each sequence consist of eight groups
with 4 bit each, which is periodically repeated. A valid one must contain 16 times a 1,
followed by 16 times a 0. The phase can therefore be one of the four shown cases.
Sequence 0: hwn,lsb hwn,msb hwn+1,lsb hwn+1,msb hwn+2,lsb hwn+2,msb hwn+3,lsb hwn+3,msb
Sequence 1: hwn−1,msb hwn,lsb hwn,msb hwn+1,lsb hwn+1,msb hwn+2,lsb hwn+2,msb hwn+3,lsb
Sequence 2: hwn−1,lsbhwn−1,msb hwn,lsb hwn,msb hwn+1,lsb hwn+1,msb hwn+2,lsb hwn+2,msb
Sequence 3: hwn−2,msbhwn−1,lsbhwn−1,msb hwn,lsb hwn,msb hwn+1,lsb hwn+1,msb hwn+2,lsb
gbt framei gbt framei+1
Figure 5.5: The four possible hw sequences. A gbt frame contains always four hws,
so only the four cases shown are possible. Everything else would just be a shift by one
complete frame.
52
5.2 – Data Preparation
clk
input signals
gbt data valid
gbt data
gbt extra data
output signals
adc valid
id o idn idn+1 idn+2
adc o d0 d1 d2 d3 d4 d0 d1 d2 d3 d4 d0 d1 d2
Figure 5.6: Data interface of the gbt decoder. The input is the payload of the gbt
frame, the 80 bit of gbt data and the 32 bit extra data from the fec field. The data
output (there are also other ports, e.g. for monitoring and configuration) consists of a
data port, delivering two adc values at once and an id port, flagging the data according
to table 5.1.
in particular the start position of the pattern to determine which of the four sequences
was present.
Each of the five hw streams contain only 16 of the 32 channels of a sampa. Since
always two adc values are decoded at once, a 3 bit id ranging from 0 to 7 is sufficient to
unambiguously mark those channels. Together with the information about the hw stream
number and the decoder number, which corresponds to the link number and therefore to
a specific card, the exact channel can perfectly be identified. The mapping between this
3 bit id and all the channels of a fec is shown in table 5.1. Each of the two subtables
presents the mapping for one of the two output ports, table 5.1a for port 0 and 5.1b
for port 1. The first column shows the id. The other five columns show the respective
sampa–channel combination for all the five hw streams, marked with d0 to d4. The tables
show this combination for both, the odd and the even regions. The even regions receive
the data from sampa 0 to 2, shown in orange, while the odd regions get the data from
sampa 2 to 5, shown in blue. For the first four columns, only the sampa number changes
between the two region types, the channel number stays the same. This was achieved by
connecting sampa 3 and 4 to gbtx 1 in the same way as sampa 0 and 1 are connected
to gbtx 0 (see the discussion about the fec layout in subsection 3.3.2). The last column
contains the mapping for sampa 2. The data of this chip is split between the two involved
regions, the even regions receive the channel 0 to 15 while the odd regions get channel 16
to 31. By knowing the id together with the number of the hw stream (and from which
region the cru receive its data) the sampa–channel combination can be obtained from
this table. With the channel and the stream number plus the region, the pad number
can be identified from which the data originates which is important for the sorting in the
next subsection.
Some additional comments about the final interface of the gbt decoder, shown in
figure 5.6, especially about the output signals. The input signals are defined by the
53
Chapter 5 – A 2D Cluster Finder for the TPC
idi
di 0 1 2 3 4
sampa ch sampa ch sampa ch sampa ch sampa ch
0x0 0/3 0 0/3 16 1/4 0 1/4 16 2 0/16
0x1 0/3 2 0/3 18 1/4 2 1/4 18 2 2/18
0x2 0/3 4 0/3 20 1/4 4 1/4 20 2 4/20
0x3 0/3 6 0/3 22 1/4 6 1/4 22 2 6/22
0x4 0/3 8 0/3 24 1/4 8 1/4 24 2 8/24
0x5 0/3 10 0/3 26 1/4 10 1/4 26 2 10/26
0x6 0/3 12 0/3 28 1/4 12 1/4 28 2 12/28
0x7 0/3 14 0/3 30 1/4 14 1/4 30 2 14/30
(a) Content of adco[0] as a function of the id and the data cycle. The crus of the even
regions receive the orange sampa channels, while the crus of the odd regions receive the
blue ones.
idi
di 0 1 2 3 4
sampa ch sampa ch sampa ch sampa ch sampa ch
0x0 0/3 1 0/3 17 1/4 1 1/4 17 2 1/17
0x1 0/3 3 0/3 19 1/4 3 1/4 19 2 3/19
0x2 0/3 5 0/3 21 1/4 5 1/4 21 2 5/21
0x3 0/3 7 0/3 23 1/4 7 1/4 23 2 7/23
0x4 0/3 9 0/3 25 1/4 9 1/4 25 2 9/25
0x5 0/3 11 0/3 27 1/4 11 1/4 27 2 11/27
0x6 0/3 13 0/3 29 1/4 13 1/4 29 2 13/29
0x7 0/3 15 0/3 31 1/4 15 1/4 31 2 15/31
(b) Content of adco[1] as a function of the id and the data cycle. The crus of the even
regions receive the orange sampa channels, while the crus of the odd regions receive the
blue ones.
Table 5.1: Content of the gbt decoder data output. The output has two fields, providing
each one adc value per cc. The content of port 0 is shown in (a) and of port 1 in (b).
Since each cru receives the data of only one half of a fec, the content is different for the
even (orange) and odd (blue) regions. Please note that for d0 to d3, the channel number
is the same for all regions, but the sampa changes from 0 to 3 and 1 to 4, while the data
for d4 comes always from sampa 2 but the channel numbers have an offset of 16 for the
odd regions.
54
5.2 – Data Preparation
underlying gbt protocol. There is the 80 bit wide gbt data bus and in addition the 32 bit
wide gbt extra data bus containing the data from the fec field. They are updated with
a frequency of 40 MHz. So in a 240 MHz clock domain, only one out of six Clock Cycles
(ccs) is utilised. If now the decoded adc values would be simply written to the output
ports after they are ready, this would be a waste of resources. The output bus would have
a width of 10× 10 bit plus additional ids which are all utilised only one sixth of the time.
To overcome this issue, a small fsm is placed before the output of the decoder (see last
block in figure 5.3) to implement a more serialised output. A maximum of six ccs would be
available to provide all the ten channels which are contained in one gbt frame on average
because the input is a continuous stream and the output must be done when the next data
arrives. A good mixture between having a uniform output and saving most resources is to
provide only two adc values at once but for five consecutive ccs. This is then the final
output interface of the decoder. The overall decoding introduces a latency of three ccs
after which the decoded values are provided for five ccs in a deterministic order, labeled
by d0 to d4 in figure 5.6 and table 5.1. As a natural basis for this disaggregation serve
the five hw streams, meaning d0 contains the lower channel numbers from sampa 0 (or
sampa 3 for the odd regions), d1 the higher channel numbers of sampa 0, and so on. The
arrows in figure 5.6 indicate that the content of one gbt frame is found in the following
output sequence. This is only true for the special case when the synchronisation pattern is
found at position zero for all the hw streams. Otherwise, the frame before will also be
used to form the output.
5.2.2 A Two-Stage Sorting
After the gbt frames have been decoded, the input channels need to be sorted. The
ordering in which the data arrives is based on the channel ordering of the sampas. The
clusters must be found in a later stage on neighbouring pads. This means, a mapping
between the individual channels of the sampas on each fec to a specific pad location must
be implemented and applied to reorder the arrival of the values to the subsequent modules.
The pad location is given by a row number within a region and a pad number within
a row. Those two quantities vary strongly between the ten regions, the number of rows
between 12 for the outermost region and 18 for region number four, and the number of pads
between 66 for the first row of the innermost region and 138 for the last rows of region nine.
More about the details of the pad plane layout can be found in appendix B. As an example
for the mapping of a sampa channel to a pad location an excerpt of the Inner Readout
Chamber (iroc) pad plane is shown in figure 5.7. The small red rectangles, in which the
corresponding sampa channel is written, represent the single pads. The vertical blue lines,
which are not necessarily straight, mark the border of a fec. All pads within one vertical
slice arrive through the same optical link and are decoded by the same gbt decoder. The
regions are separated by the horizontal straight blue line which is just visible at the upper
edge of the figure. The general layout is the same as in the figures in appendix B where
the sampa id is given instead of the sampa channel number. Within the sorting module
the transition must be made, from the up to 20 input links, which correspond to vertical
slices in the pad plane, to the 18 rows which are horizontal. In the excerpt can be seen
already by eye that the assignment of a certain channel to a row or to a pad within a row
55
Chapter 5 – A 2D Cluster Finder for the TPC
Figure 5.7: Excerpt of the iroc pad plane with the corresponding sampa channels
written on the individual pads, taken from [34].
is different for the individual fecs, and even more diverse if one compares the mapping
between different regions. Unfortunately, it does not follow a pattern, but was chosen
instead to fulfil other criteria, e.g. to have similar trace lengths between the pad and the
connector. Therefore, the mapping must be configured individually for each input channel
of which there are up to 1 600 in one cru. There are three requirements for the sorting
module:
1. It must finish in time. The readout is a continuous one and the data can not be
buffered. With a clock frequency of 240 MHz, only 48 ccs are available to sort all
the 1 600 channels until a new readout cycle starts.
2. The resource consumption must not be unreasonably high, there is still the cf
exercised later in the fw which is expected to be the largest consumer of resources.
3. The sorting must be configurable during runtime in a way, that the mapping for all
the ten different regions can be achieved with the same fw.
Different approaches have been examined to determine whether they meet these conditions.
The simplest method was to use the id delivered by the decoder together with the source
link to just look-up the target row and the target pad. This information is then used to
store the adc value in a 2-dimensional array (18 rows× 138 pads/row = 2 484 pads) from
which the value can be taken for further processing. This approach is in principle very fast
because only the look-up latency has to be taken into account. Also, the configurability
is given because the mapping depends only on the content of the Lookup-Tables (luts)
which can be filled with arbitrary values. However, the resource consumption is totally off,
simply due to the high combinatorics. Each of the 20× 2 output ports of all the decoders
would be able to write into all of the 2 484 possible pads. This gives in total close to
56
5.2 – Data Preparation
C0 C’0
C1 C’1
C2 C’2
C3 C’3
C4 C’4
C5 C’5
C6 C’6
Figure 5.8: Illustration of a sorting network for seven elements. The vertical intercon-
nections represent a comparison of the two involved numbers with a possible exchange
afterwards, if necessary. Independent of the order of the input values Cx, the output C’0
to C’6 are always sorted.
106 paths which need to be realised and multiplexed for 10 bit adc values. This number
could be reduced by additional boundary conditions like limiting the possible pad range
within a row for the individual fecs. Combining the mapping from all regions, it can be
found that e.g. link 0 transmits only the pads 0 to 6 and link 1 the pads 4 to 13 and so
on. Also, the possible target row can be restricted by knowing that e.g. sampa 0 and 3
are always connected to the lower row numbers. However, besides the effort to code all
those limitations, the combinatorics stays very high and the module would still consume a
majority of the resources which rules out this approach.
Another method considered was to use a real sorting algorithm to achieve the correct
mapping. A set of sorting networks [54] can sort the channels according to the row and
pad numbers which are looked-up for each channel. The working principle of a sorting
network is indicated in figure 5.8. Each input number, in our case the pad number, is
represented as a horizontal wire. They are compared with each other in a predefined
ordering and swapped if those two are not ordered. In this way, an arbitrary sequence
of pad numbers is sorted. However, a network for all the 138 possible pads of a row
(assuming each row is sorted independently) would be quite huge, since the complexity
increases with O(n · log2 n) [54], and the final design would need 18 of those. To reduce
the complexity of the individual networks, it is possible to split up the big sorting network
into smaller ones which pre-sort the channels of only one link. Those pre-sorted segments
then need to be merged in a second stage. With this, the number of inputs of one network
is decreased dramatically to only seven, which is the highest number of neighbouring pads
delivered by one link for all possible fec locations in a sector. An optimal network, i.e.
the smallest number of comparisons required, needs only 16 comparisons for seven inputs.
The network shown in figure 5.8 is such an optimal network for seven inputs, generated
with the Bose-Nelson algorithm [55]. Then, the resulting row segments still have to be
merged in a second stage to complete the full pad row. However, the disadvantage of this
approach is that all the channels need to be provided in parallel, instead of sequentially
as it was implemented for the gbt decoder in order to save resources. In addition, not
only the 10 bit word of the adc value is needed, but also a 6 bit number to identify a
row between 0 and 17 and a 8 bit number for a pad between 0 and 137. Even if the row
number is omitted because a predefined pipeline is used for each row, the bits per channel
is almost doubled, which leads in the end to big routing matrices and with that to a very
high resource consumption.
57
Chapter 5 – A 2D Cluster Finder for the TPC
The final solution keeps this two-stage approach with a pre-sorting on the fec level and
a subsequent merging of the row segments but replaces the sorting network with a ram
cell. The sorting can also be achieved by writing the adc values to pre-defined addresses of
the ram and reading them again in a configurable order. This allows to use ram cell of the
fpga, which already provides the routing matrices without additional resources. With that
can the serialised output of the gbt decoder be used as an advantage. The Arria10 has
two types of embedded memory, the 20 kbit Memory Blocks (m20ks) which are dedicated
blocks of 20 480 bit memory resources, and the Memory Logic Array Blocks (mlabs) which
are 640 bit wide and are made up of ten Adaptive Logic Modules (alms), i.e. from logic
resources [45]. The mlabs can be configured as simple dual-port rams, giving one write
and one read port, whereas the m20ks can be configured also as true dual-port rams,
giving two independent write and read ports. Due to the continuous data stream to the
sorting module of two adc values at once for five out of six ccs, the true dual-port mode is
needed to be able to deal with both values at once. Therefore the m20k was chosen for this
purpose even though the amount of stored data of 80 channel× 10 bit/channel = 0.8 kbit
is way smaller than the available space.
However, having one ram in true dual-port mode is not sufficient in this case. If the
two available ports are continuously used for the writing process, there is no time to read
the data back. So in principle a quad-port ram would be needed, two ports for the writing
and two ports for the reading processes. To have in total four ports available, two ram
cells can be used. The data must be stored via the two ports always in the same ram and
can not be split across two independent rams because of the reading sequence. Having
the data split would require that also during the read sequence, in each cycle one value is
read from each of the involved rams. This strongly limits the freedom of possible output
sequences whereas having all the data in the same ram allows for an arbitrary reading
sequence. Therefore, those two dual-port rams must be implemented in a ping-pong ram
configuration. While one ram is being written to, the other ram is being read. After a
completed cycle they are swapped and read from the first and written to the second one.
Such a ping-pong ram is the core of the sorting module, shown in figure 5.9. Each of the
two ports is dedicated to one of the gbt decoder output ports, ram port A receives the
data of adco[0] and port B of adco[1]. The two respective write addresses are calculated in
a deterministic way from the delivered id and the data cycle in the following way:
addA = (di · 16) + (idi · 2) (5.1)
addB = (di · 16) + (idi · 2) + 1, (5.2)
where di ranges form 0 to 4 and idi from 0 to 7, as can be seen in table 5.1. In this way it
is ensured that the data from the same channel of the same sampa is always stored at the
same address, for all the fec positions in all regions. To apply the sorting, those addresses
have to be asserted to the two read address ports in the correct order. With this, pad after
pad can be read from the ram, where the data from port A is always used first by combining
the two values. This sequence of two addresses is always the same for a specific fec location
and can therefore be stored in a lut during the configuration of the cru at runtime.
So far, the output of the ping-pong ram would be just a sequence of channels, as the
input but in a different order. To complete the sorting in two dimensions, row-breaks are
58
5.2 – Data Preparation
RAM 0
RAM 1
segmenter
LUT
address
generator
ido
adco[1:0]
Figure 5.9: A configurable pre-sorting module based on a ping-pong ram. The write
address is generated from the id marking the sampa channel while the read addresses are
stored in the correct ordering in a lut. The values of the read sequence are then divided
into short row segments, indicated as 80 green filled pads in the reduced pad plane.
needed at the correct location. For this, a two bit command is appended to the set of two
read addresses which is used by the segmenter, schematically shown in figure 5.9, to split
up the sequence of channels into short row segments. The length of the segments varies
between four and seven pads, depending on the row and fec location. The two bits are
needed because three cases are possible: the row must be changed after the first of the
two simultaneously read pads, after the second pad, or not at all in this read cycle. To
have the behaviour of the segmenter completely controlled solely by this command, the
fourth available bit combination is used to mark the very last read cycle after which this
module is reset and starts again with the first pad in the first row. In this way, the sorting
is completely flexible and depends only on the content of the lut. Since a maximum of
7 adjacent pads and 18 rows are read by one fec, the port width of the module can be
limited to 18 times 7 pads. The second output bus of the pre-sorter is a bit-mask with a
width of 18, marking first the validity of the segment output bus, and second identify to
which row the current segment belongs to. Also here, the segment output is sequential,
one segment after the other via the same port in order to spare resources on unnecessary
parallel outputs which are rarely used.
As a next step, those variable-length segments that belong to the same row but come
from different links need to be merged to reassemble the complete pad rows. It turned
out that it is impractical to have the full pad rows with a length of in total 138 pads
each, at least there is no advantage in having them. On the contrary, keeping the pad row
segmented allows for a more efficient implementation of the baseline correction module
and cluster finder. However, the segmentation needs to be changed from a variable to
a fixed length. It was seen that the optimal size is six pads, for two reasons. First, the
baseline correction is mainly done with Digital Signal Processors (dsps) which can be
configured to process two streams in parallel (more in subsection 5.2.4). By having three
dsps utilised, all six pads of the segment can be processed in parallel without using a
dsp only half. The second reason is that each cluster finder instance has six unique and
four shared pads (more details are discussed in section 5.3). Having the segmentation
set to six has the advantage that each instance needs the data of only two consecutive
59
Chapter 5 – A 2D Cluster Finder for the TPC
segments. So the task of the row merger module is now to realise the transition from the
processing of each input link individually, to a joint processing of all links combined, by
merging the segments belonging to the same row from all links. One way to realise this is
to implement all sequential mappings and make them selectable via two parameters, the
region number which is configurable during runtime, and the row number within the region
which is a generic parameter of the module. Having the latter one as a generic allows to
instantiate the same module — containing all the mappings — once for each of the 18 rows
per region and letting the compiler remove the respective irrelevant parts. This means that
this module is not completely configurable at runtime, but the correct mapping for the
relevant region can be selected during runtime, fulfilling all requirements. The complete
mapping for all regions consists of more than 4 000 assignment statements. Since it is
almost impossible to manually write this error-free, it was generated automatically using
the alice O2 framework (section 6.3). The original source files which were also used to
build the real pad plane are used here to provide the correct mapping for simulation and
reconstruction purposes. During the assignment, there are two empty pads (filled with an
adc value of 0) added to the left side of the row, below pad zero, and four empty pads
added to the right side of the row. The two pads on each side are necessary to be able to
find peaks also at the borders (more in section 5.3) and the additional two on the right
side to complete the segmentation of six (2 + 138 + 4 = 144, which can evenly be divided
by six). Since only the outermost rows of region nine have 138 pads in the real system,
there are some unused pads with their signal set to zero by default. The output ordering
is a bit unintuitive, it starts with the segment containing the highest pad numbers and
goes down to the segment with the lowest pad numbers. That is still a remnant from the
time when the cf was not yet completely worked out and was introduced to unravel the
processing. Since this order is not really important for further processing — it only has to
be implemented correctly in the following modules — it was never changed.
5.2.3 Common Mode Calculation
The cm effect is the main reason why the readout scheme had to be changed with respect to
the original Technical Design Report (tdr). The correction algorithm to be implemented
has already been introduced in subsection 3.1.1, in summary: the mean adc value of all
pads, with the signals excluded, must be subtracted from every pad for each time bin. This
algorithm can well be separated into two steps, first the calculation of the mean value,
which is done in this module, and second the subtraction of this value from the individual
pads, which is done together with the other corrections in the blc module. How the cm
calculation is realised is shown in figure 5.10. The input to this module are the outputs of
the 20 gbt decoders and the output is one value for every 5 MHz readout cycle. As a first
step, the peak exclusion takes place. For this, two consecutive time bins must be buffered
to be able to exclude the whole peak for the best performance. It must be noted that
the peak detection and exclusion is only necessary in time direction since it is done for
all pads anyhow. So there must be no sorting beforehand to group together neighbouring
pads. That is why this calculation can take place in parallel to the sorting algorithm
which was described in the previous subsection. To find a peak, the difference between two
consecutive time bins is calculated and if this exceeds a configurable threshold, a rising
60
5.2 – Data Preparation
divideraccumulatorstage 2
accumulator
stage 1
accumulator
stage 1
peak
exclusion
peak
exclusion
..
.
..
.
dec. 0
dec. 19
CM
Figure 5.10: Block diagram of the cm calculation module. A peak exclusion module and
a first accumulator stage is implemented for each if the input links. A second accumulator
stage then combines the results and a divider normalises the value to the number of
contributing pads.
edge in the signal is detected. If this rising edge is followed by a falling edge, again via the
difference of two consecutive time bins, the peak is found and can be excluded. Since three
time bins are needed to be able to decided that a peak is present, two time bins need to
be buffered.
After the exclusion, the first accumulator stage takes place. Here, the charge of all
surviving pads of one time bin is summed up and the number of contributors (number
of surviving pads) is counted. Before summing them up, the pad-wise pedestal value
needs to be subtracted. This is not important for the peak detection since this additive
contribution cancels out at the subtraction of two time bins of the same pad, but would
bias the accumulated charge in an indeterministic way because of the excluded pads. If
no pad would be excluded, just the sum of all pedestal values could be subtracted from
the final accumulated charge. A corresponding accumulator could also be implemented
by hand. However, in order to save resources one should use for this kind of tasks the
provided dsp blocks. The utilised Arria10 fpga contains in total 1 518 dsps, of which
some can be used at this place. The Arria10 native fp dsp ip core, which is able to carry
out arithmetic operations like adding and multiplying of fp numbers, will be used. A
detailed description of the ip core and all its functionalities is given in [56]. Important for
the application is that it can be configured for different operating modes. The so called
“18× 18 Sum of 2” mode, of which the block diagram is shown in figure 5.11a, fits perfectly
to the requirements of this module. There are two parallel input groups — the gbt decoder
output is two adc values at once — each having first a Pre-adder available which can
be used for the pedestal subtraction. In this case, the Multiplier which comes next for
each input can be set to one and therefore be ignored. Next, the Adder calculates the sum
of both pedestal subtracted input values and the Chainadder in the end can be used to
accumulate the charge of a complete readout cycle. This single dsp core takes care of all
the arithmetic operations needed for the first accumulator stage, without any additional
resources, by applying the equation
resulta =
∑
((ay− az) + (by− bz)) (5.3)
to the input signals. The names of the variables were taken from the block diagram so
that they can be mapped directly to the individual ports. Also, the bit width of the dsp
61
Chapter 5 – A 2D Cluster Finder for the TPC
Figure 3. The 18 × 18 Sum of 2 Mode Architecture
resulta[63:0]
+/- ++/-
chainin[63:0]
chainout[63:0]
ay[18:0] 
ax[17:0] 
sub 
coefsela
[2:0] 
ena[2:0] 
aclr[1:0] 
clk[2:0] 
negate 
accumulate 
loadconst 
az[17:0]
+/-
by[18:0] 
bx[17:0]
coefselb
[2:0]
bz[17:0]
+/-
scanin[18:0]
scanout[18:0]
Pip
eli
ne
 Re
gis
ter
s
Pip
eli
ne
 Re
gis
ter
s
Inp
ut
 Re
gis
ter
s
Inp
ut
 Re
gis
ter
s
Top 
Multiplier
Bottom 
Multiplier
x
x
Internal
Coefficient 
Internal
Coefficient 
Output
Register
Adder
Chainadder
Double 
Accumulator
Register
Top 
Pre-adder
Bottom 
Pre-adder
Bottom 
Delay 
Register
1
1
1
1
19
18
18
3
19
18
18
3
3
3
2
Inp
ut
 Re
gis
ter
s
Top
Delay 
Register
3.1.3 The 18 × 18 Plus 36 Mode
When configured as 18 × 18 Plus 36 mode, the Arria 10 Native Fixed Point DSP IP
core enables only the top multiplier. This mode applies the equation of resulta = (ax *
ay) + az.
3 Functional Description
Intel® Arria® 10 Native Fixed Point DSP IP Core User Guide
15
(a) The “18× 18 Sum of 2” mode.
Figure 2. The 18 × 18 Full Mode Architecture
resulta[36:0]
resultb[36:0]
ay[18:0] 
ax[17:0] 
coefsela[2:0] 
ena[2:0] 
aclr[1:0] 
clk[2:0] 
az[17:0]
+/-
by[18:0] 
bx[17:0]
coefselb[2:0]
bz[17:0]
scanin[18:0]
scanout[18:0]
Pip
eli
ne
 Re
gis
ter
s
Inp
ut
 Re
gis
ter
s
Top Multiplier
Bottom 
Multiplier
x
x
Internal
Coefficient 
Internal
Coefficient 
Output
Register
Top 
Pre-adder
Bottom 
Pre-adder
Bottom 
Delay 
Register
19
18
18
3
19
18
18
3
3
3
2
Inp
ut
 Re
gis
ter
s +/-
Top
Delay 
Register
3.1.2 The 18 × 18 Sum of 2 Mode
In 18 × 18 Sum of 2 Mode, the Arria 10 Native Fixed Point DSP IP core enables the
top and bottom multipliers and generates a result from addition or subtraction
between the 2 multipliers. The sub dynamic control signal controls an adder to
perform the addition or subtraction operations. The resulta output width of the Arria
10 Native Fixed Point DSP IP core can support up to 64 bits when you enable
accumulator/output cascade. This mode applies the equation of resulta =[±(ax * ay)
+ (bx * by)].
3 Functional Description
Intel® Arria® 10 Native Fixed Point DSP IP Core User Guide
14
(b) The “18× 18 full” mode.
Figure 6. The 27 × 27 Mode Architecture
resulta[63:0]
++/-
chainin[63:0]
chainout[63:0]
ay[26:0] 
ax[26:0] 
coefsela
[2:0] 
ena[2:0] 
aclr[1:0] 
clk[2:0] 
negate 
accumulate 
loadconst 
az[16:0]
+/--
scanin[26:0]
scanout[26:0]
Pip
eli
ne
 Re
gis
ter
s
Pip
eli
ne
 Re
gis
ter
s
Inp
ut
 Re
gis
ter
s
Inp
ut
 Re
gis
ter
s
Multiplier
x
Internal
Coefficient 
Output
Register
Chainadder
Double 
Accumulator
Register
Pre-adder
1
1
1
27
26
27
3
3
3
2
3.2 Optional Modules
The optional modules available in the Arria 10 Native Fixed Point DSP IP Core are:
• Input cascade
• Pre-adders
• Internal Coefficient
• Accumulator and output cascade
• Pipeline registers
3.2.1 Input Cascade
Input cascade feature is supported on ay and by input bus. When you set Enable
input cascade for 'ay' input to Yes, the Arria 10 Native Fixed Point DSP IP core will
take inputs from scanin input signals instead of ay input bus. When you set Enable
input cascade for 'by' input to Yes, the Arria 10 Native Fixed Point DSP IP core will
take inputs from ay input bus instead of by input bus.
It is recommended to enable the input registers for ay and/or by whenever input
cascade is enabled for correctness of application. When you enable the input registers
for ay and by, the clock source of these registers must be the same.
3 Functional Description
Intel® Arria® 10 Native Fixed Point DSP IP Core User Guide
18
(c) The “27× 27” mode.
Figure 5.11: Block diagram of the Arria10 native fp dsp ip core in three different
configuration modes, all three taken from [56].
62
5.2 – Data Preparation
is sufficient because at most 80 channels are summed up for the individual links, each a
10 bit adc value. This means that a 17 bit number can hold the maximum possible value
of 80 · 1 023 = 81 840. Even after adding four additional bits for the fp decimals, the result
is significantly below the maximum output width of the dsp of 64 bit.
In the second stage, the results from the individual links need to be further accumulated.
A different dsp configuration was chosen, the “27 × 27” mode shown in figure 5.11c, to
compensate for the increased bit width. This configuration provides only one input which
can therefore be up to 27 bit wide, sufficient to accumulate the 21 bit numbers of the
previous stage. In this case, the Pre-adder can be used to add up two values which are
then further accumulated by the Chainadder. A small fsm is written to first assert the
results of always two links to the inputs ay and az, and afterwards reuse this dsp to also
sum up the number of contributors of each of the links. There is enough time to accomplish
these two task sequentially because each accumulation needs only 10 ccs, 20 links divided
by two simultaneous operations. Also here, 48 ccs are available to process the complete
readout cycle.
The last step in the calculation of the cm value is the division of the accumulated charge
by the number of pads which contributed to the value. A soft ip core provided by Intel is
used for this. Those by the vendor delivered ip cores offer usually a more efficient logic
synthesis compared to a self-written vhdl module. All modules provided to dedicated
tasks, of which one is a divider, are described in [57]. They can be used by instantiating the
corresponding module, being configured through several generics to adapt for the design
requirements. Since there is plenty of time at this stage (again up to 48 ccs), the divider
is configured to use many pipeline stages for the devision of a 26 bit number by a 11 bit
number to reduce the resource consumption.
The implementation of the module is a shared task. The development of the general
concept is part of this thesis while the actual implementation in vhdl is carried out by
colleagues from the Nagasaki Institute of Applied Science in Japan, which is still work in
progress. After the implementation is carried out, the total latency needs to be determined.
Should it be larger than the latency of the sorting and merging modules the data needs to
be delayed so that the cm value of the correct time bin is used in the blc module. This
can be done in the ping-pong ram of the pre-sorting. Currently only 80 out of the 2 048
available addresses are used for the sorting. By having an offset between the write and
read addresses, up to b2048/80c = 25∗ time bins could be buffered here without additional
resources.
5.2.4 Baseline Correction
The last module of the data preparation part is the blc module. Three contributions
need to be corrected for, first the gain non-uniformity, second the much discussed moving
baseline, the cm effect, and third the pedestal value of the adc of the electronics. While
the latter two corrections are additive, the first one is multiplicative. Since these distortions
are applied to the real signals in this order, they have to be corrected in reversed order,
first the pedestal value, then the cm and last the gain. This can also be summarised in the
∗Those brackets symbolise the floor-function, floor(3.3) = b3.3c = 3.
63
Chapter 5 – A 2D Cluster Finder for the TPC
DSP
DSP
DSP
LUT
min/max
min/max
min/max
CM gain pedestal
ADCraw,0
ADCraw,1
ADCraw,2
ADCraw,3
ADCraw,4
ADCraw,5
ADCcor,0
ADCcor,1
ADCcor,2
ADCcor,3
ADCcor,4
ADCcor,5
+
+
+
+
+
+
Figure 5.12: Block diagram of the blc module. The six parallel input values are
processed by three dsp blocks. The gain correction factors and pedestal values are stored
in a lut. Before using the pedestal value, the cm contribution is added. After the
correction is done, a module ensures that 0x0000 ≤ adccor ≤ 0x3FFF is always true.
equation
ADCcor = (ADCraw − cped + ccm) · cgain, (5.4)
where ADCcor is the corrected adc value, ADCraw the raw value from the Front-End
Electronics (fee), cped the pedestal value, ccm the cm value and cgain the gain correction
factor. It would be beneficial in terms of resource consumption if also this arithmetic
operations, which needs to be applied to each individual pad, could be carried out by a
dsp. Indeed, there is another configuration of the dsps available which implements almost
exactly this equation and then also on two independent inputs. The configuration is called
the “18× 18 full” mode of which a block diagram is shown in figure 5.11b. Only the two
additive factors, cped and ccm, have to be combined beforehand. Besides that, there is
again the Pre-adder which is used for the additive correction and this time the Multiplier
is used for the gain correction. Also, in terms of port widths fulfils the dsp the needs. The
input is the 10 bit adc value, converted beforehand to a 10.4 fp number, which is also the
expected output precision.
To be able to process the six pads of the row segment from the sorting and merging
module simultaneously, three dsp cores are needed. This is also shown in the block diagram
of the blc module in figure 5.12. Each of the three dsps processes two adc values in
parallel. The needed gain correction factors and pedestal values are stored in a lut during
configuration from which they are read, depending on the input segment number. Although
the number of words in the lut is 24 and thus quite low (one per segment), the width of each
is rather high because in total six sets of gain factors and pedestal values need to be read
at once. The precision of those factors are chosen to be the same as the resulting corrected
adc value, therefore they are 10.4 fp numbers as well. The module was designed in a
way that each correction can individually be enabled or disabled. With the corresponding
configuration it can be decided during runtime which correction should be turned on or off.
64
5.3 – Cluster Reconstruction
A further process is connected downstream of the calculation, checking the results. Since
the cf is designed to work with positive 10.4 fp numbers, it must be ensured that the
output values of the blc module are exactly this. By subtracting too big pedestal values,
it could happen that the result of the calculation turns negative. This is a rather easy
task to filter out because negative numbers are naturally represented as two’s complement†
in an fp (or integer) arithmetic. So only the msb of the result needs to be checked. If
this bit is one, the complete value is set to zero. On the other hand, by adding a big cm
value or multiplying with a gain correction factor above 1, a possible overflow can occur,
meaning the result is bigger than the maximum possible value in 10.4 fp representation of
1 023 + 1516 = 0x3FFF. For this, all bits between the msb of the result and the msb of the
interesting 14 bit of the fp number need to be checked. If any of the bits is one, an overflow
has occurred and the result needs to be set to the highest possible value. This module is
correcting only the baseline, meaning that the adc value is corrected and transformed from
a 10 bit integer into a 10.4 fp number. The structure of the interface, the segmentation
into 24 times 6 pads and the valid signals are kept the same as from the merger module.
5.3 Cluster Reconstruction
After the data is prepared, finally the reconstruction of the charge clusters can be done.
This process is subdivided into three parts, the cf which implements a true 2-dimensional
peak finding, the Cluster Processor (cp) to calculate the cluster properties and to generate
the Cluster Data Format (cdf). As a last step, there is a merging network which combines
the output of the individual cps into four final fifos. From here, the cluster data are then
written to the dma fifos to transfer them to the host machine for further processing and
forwarding to the Event Processing Nodes (epns) for the online tracking and storage.
As already mentioned, the clusters will be searched for in each pad row individually.
This results in the ideal case in one cluster per row for a track traversing the whole tpc,
giving the corresponding number of space points for the later tracking algorithm. But
before starting to implement the cf algorithm, one has to clarify what actually needs to
be found and processed. Especially the size of the clusters in pad and time direction need
to be investigated. By fixing those parameters at an early stage, the whole development
and implementation process can be simplified.
5.3.1 Determining the Cluster Size
It was found and stated already in the original tpc tdr that for a gem based readout
system, the broadening of the electron cloud is dominantly given by the diffusion during
the drift time [29]. The spread by the gem intrinsic properties is therefore negligible and
only the diffusion contribution needs to be taken into account in the calculation of the
expected cluster size. The spread of the electron cloud in longitudinal and transversal
direction is then the product of the diffusion coefficient DT/L and the drift length ldrift:
δL/T (ldrift) = DL/T ·
√
ldrift, (5.5)
†To calculate the two’s complement of a number, one has to invert the individual bits and then add 1.
65
Chapter 5 – A 2D Cluster Finder for the TPC
gas mixture drift velocity diffusion coefficent
vd (cm/µs) DL (
√
cm) DT (
√
cm)
Ne-CO2-N2 (90-10-5)‡ 2.58 0.0221 0.0209
Ne-CO2 (90-10) 2.73 0.0231 0.0208
Ar-CO2 (90-10) 3.31 0.0262 0.0221
Table 5.2: Diffusion coefficients and drift velocities for electrons in different gas mixtures,
evaluated at an electric field of 400 V/cm, based on [5].
taken from [29]. These diffusion coefficients for electrons in both directions were determined
for different gas mixtures and are summarised in table 5.2, together with the drift velocities.
The coefficients for the transverse and longitude directions are very similar for each of the
shown gas mixtures. They are all CO2-based mixtures of which the one with added Nitrogen,
the Ne-CO2-N2 (90-10-5)‡, is the baseline gas mixture for the tpc in Run 3. The width
of a cluster in pad direction can then be estimated for this gas composition by calculating
σT = δT = 0.0209
√
cm · √250 cm ≈ 0.3305 cm (5.6)
for the maximum drift length of the tpc of 250 cm if the cluster was generated near the cen-
tral electrode. Taking a 3σ interval into account, a width of the cluster of 1 cm must be cov-
ered to include most of the charge. With the smallest size of the individual pads of 4.16 mm
in the iroc and the largest size of 6.08 mm in the oroc 2 [34], the maximum cluster width is
then spread across±2.4 pads around the peak in the iroc and±1.6 pads in the oroc. So the
size of a cluster can safely be fixed to 5 pads, containing in the worst case for the iroc 99.84 %
of the charge. For the orocs, a cluster width of 4 pads would also be fine but this introduces
two complications. First, one would need to implement two different sizes for the cluster
which makes the whole implementation more complicated. Secondly, if the number is odd,
the centre is uniquely determined and the cluster extends two pads in both direction. With
an even number of pads, one would need to compare the pads next to the centre and add the
fourth to the corresponding side with the larger neighbour, which is in this case an unneces-
sary complication. Therefore, the size in transversal direction is fixed for all rocs to 5 pads.
For the extension in the time direction, two more contributions need to be taken into
account in addition to the diffusion: the track inclination angle and the shaping time of
the electronics. Tracks with a higher inclination angle will deposit charge more along the
drift path of the electrons. This extends the cluster size in time direction. The mean width
of the clusters in longitudinal direction can be calculated with the equation:
σ2L (r, ldrift) =
1
v2d
(
δ2L (ldrift) +
tan2 λ(r, ldrift) · L2pad (r)
12
)
+ σ2ele (5.7)
= σ2det (r, ldrift) + σ2ele, (5.8)
‡ Must be normalised to 100 %. This statement refers to a mixing ratio between Ne and CO2 of 9:1 to
which 5 % of N2 is added. The notation is given in this way for easier comparison with other gas mixtures
which have no or a different N2 admixture.
66
5.3 – Cluster Reconstruction
0 50 100 150 200 250
Drift length (cm)
0
50
100
150
200
250
Cl
us
te
r w
id
th
 (n
s) total IROC
total OROC1
total OROC2
total OROC3
diffusion only
Figure 5.13: Extension of the clusters in time direction as a function of the drift length,
calculated using equation 5.7 without the contribution of σele. The equation was evaluated
at all the roc boundaries and the pad lengths given in table 5.3 for the baseline gas
mixture. The acceptance shown is limited to the relevant region of |η| < 0.9, based on [5].
taken from [5], where δL(ldrift) is the spread due to the diffusion during the drift, λ(r, ldrift)
the inclination angle of the track with respect to the pad plane, Lpad(r) the length of
the pads in the different rocs and σele the approximate sigma of the semi-gaussian signal
output of the electronics. The dependency of the inclination angle on the drift length and
the radius is a pure geometrical effect. Clusters with a long drift length were generated
close to the central electrode and therefore originate from tracks with a smaller inclination
angle. Tracks which point towards the iroc — to a smaller radius — have a larger
pseudorapidity and therefore also a greater inclination angle compared to those which
point towards the orocs. The contribution of the electronics depends on the shaping
time of tFWHM = 190 ns and can be approximated by dividing by a constant factor of 2.4:
σele ≈ tFWHM/2.4 ≈ 80 ns [5]. The detector component σdet of the equation is plotted in
figure 5.13 for the baseline gas mixture as a function of the drift length. The dashed line is
the contribution of the diffusion term, which increases with increasing drift length. The sum
of this diffusion term and the angular contribution is plotted for all the rocs individually
in different colours. With a decreasing radius (going from oroc 3 to iroc), the inclination
angle increases and with that also its contribution to the total cluster size. The inclination
angle is particularly large for clusters that are created close to the readout planes and
therefore have a short drift distance. Only the relevant pseudorapidity range of |η| < 0.9
was taken into account for the plot, which is why the drift distances for e.g. the iroc
(shown in red) start from 115 cm to 164 cm. For the orocs the same η range corresponds to
a larger drift length. The different radii and pad lengths which were used in the calculation
are given in table 5.3. Requiring that a 3σ range of the signal is contained within 5 timebins
67
Chapter 5 – A 2D Cluster Finder for the TPC
roc region radius pad length pad width
inner (cm) outer (cm) (mm) (mm)
iroc
0 84.85 7.5 4.16
1 7.5 4.20
2 7.5 4.20
3 132.1 7.5 4.36
oroc 1 4 134.7 10.0 6.005 168.7 10.0 6.00
oroc 2 6 170.8 12.0 6.087 206.8 12.0 5.88
oroc 3 8 208.9 15.0 6.049 246.4 15.0 6.07
Table 5.3: Geometric parameters of the pad plane of a tpc sector. The given radii
correspond to the bottom of the lowermost and the top of the uppermost pad row in the
individual rocs [34].
of the 5 MHz sampling clock, the condition 3σL ≤ 2.5 · 200 ns must be fulfilled, or
σdet ≤
√(2.5
3 · 200 ns
)2
− (80 ns)2 ≈ 146.2 ns, (5.9)
which is the case for iroc and oroc 1 as can be seen in the figure. Even for the maximum
cluster length of 180 ns in oroc 3, the 5 timebins would still contain
2.5 · 200 ns√
(180 ns)2 + (80 ns)2
≈ 2.54σL (5.10)
of the signal, or 98.89 % of the charge. It is therefore also valid in time direction to limit
the cluster to 5 time bins. To summarise, the surrounding 5 × 5 matrix around a peak
in pad and time direction contains even in the worst case more than 98.73 % of the charge.
Since the intrinsic energy resolution of the gem system is ∼12 % [5], the cluster size can
be fixed to these widths because the loss of charge of 1.27 % due to this cut is negligible.
Increasing the acceptance to |η| < 1.4, which is also used sometimes for the tpc, leads
to a maximum width of the cluster of 210 ns in time direction. By limiting the cluster to
5 pads in this case, already 2.8 % of the charge can not be taken into account. Although,
this is still small compared to the 12 % energy resolution of the gem system.
5.3.2 The Concept of the Cluster Finder
The basic approach of the cluster reconstruction is to find a 2-dimensional peak in pad and
time direction in each pad row individually. With this approach, a 3-dimensional cluster
finding is reduced into a 2-dimensional problem which simplifies the processing. Then
the charge of the surrounding bins of a peak are used to compute the cluster properties
68
5.3 – Cluster Reconstruction
. . .
tn
tn+1
tn+2
...
tn
tn+1
tn+2
...
. . .
. . .
0 9 12 21 24 33 36 45
6 15 18 27 30 39 42 51
Figure 5.14: General concept of the cluster finding approach. Each of the cf instances
gets only a short segment of the pad row (here ten neighbouring pads) in which they look
for peaks. There are some double pads shown in the top and the bottom row (e.g. pads 6
to 9 for the first cfs in both rows) to compensate for the additional borders between the
individual instances. The orange dashed rectangle marks the allowed positions for a peak
to ensure that a cluster is found only once.
like the total charge, the positions in pad and time direction as well as the widths in
those directions, assuming a gaussian distributed charge both in pad and time direction.
It was seen before that one key element in the ability to handle these data rates is to
keep the individual data streams as much separated as possible. The same is true for
the clusterisation. So the general idea is to have many small, independent cf instances,
each looking only at a small segment of the pad row for peaks. This basic concept is
shown in figure 5.14. The pad row is subdivided into small pieces. Each instance receives
time bin after time bin of only those pads. They are buffered locally and the individual
instances look independently for peaks in their own region. Since 5 timebins need to be
taken into account for the calculation of the cluster properties, at least this amount of data
needs to be buffered. The Arria10 offers plenty of small memory blocks, the mlabs, which
are excellently suited for this purpose. Each mlab is formed from ten alms, which are
configured as ten 32× 2 bit memory blocks, giving one 32× 20 bit simple dual-port sram
block per mlab. Assuming that always ten pads are processed by one cf instance, then a
local storage of at least 5× 10× 14 bit = 700 bit is needed, which fits into two mlabs.
The subdivision of the pad row into small pieces introduces additional artificial borders
between the individual cfs, which need to be compensated for. One way to deal with that
is to have overlapping pads — pads which are present in more than one cf. For example,
pads 6 to 9 are present in the top left as well as in the bottom left of the cf instances
in figure 5.14. Though, it must be ensured that a peak is found only once, even if the
charge is present in two cfs. This is achieved by restricting the allowed position of the
69
Chapter 5 – A 2D Cluster Finder for the TPC
(pn, tn)
(pn−1, tn+1) (pn, tn+1) (pn+1, tn+1)
(pn+1, tn)
(pn+1, tn−1)(pn, tn−1)(pn−1, tn−1)
(pn−1, tn)
≤
≤
<
<
≤ ≤
<<
Figure 5.15: The definition of a peak. The value of the centre bin must be larger than
the four bins on the bottom left part, but only larger or equal to the four on the top right
part.
peak within the instances. This region for the peaks is indicated by an orange dashed
rectangle in the figure. Those regions do not overlap so that the peaks are not found twice.
The region ends in the first cf (top left) with pad 7 and starts in the second cf (bottom
left) with pad 8 and so on. The number of double pads per cf is given by the cluster size.
As it was described in the previous subsection, the width of the clusters is fixed to five
pads. With this, it is clear that always two additional pads have to be added on both
sides of the short segment to have the complete charge of a cluster contained in one cf.
So the total width in pad direction clearly needs to be optimised in order to not waste
too many resources for double pads while still having the time to find all clusters. Also
the number of buffered time bins can be used for this optimisation problem. Since the
processing time is an important restriction, the actual peak finding algorithm needs to be
fixed before starting to optimise the size of the cf instances.
5.3.3 The Peak Finding
To find a peak, one first has to define what needs to be found. The most basic definition
of a peak would be the adc value is greater than the value of all the eight neighbouring
bins. This definition has an obvious problem: if the value of two adjacent bins is the same,
and their value is also higher than that of the other surrounding ones, then the peak is not
found because neither of the two maximum values is higher than all the others. This can
be avoided by introducing an asymmetry. In one dimension, a peak would then be defined
as bigger than the left pad and bigger or equal than the pad on the right side. If this logic
is applied to two dimensions, the pattern shown in figure 5.15 results. So a peak is found
at the central pad if the value is bigger than the one of the pad on the left side and bigger
than the values of all three pads of interest in the last time bin, and bigger or equal than
the value of the pad on the right side and bigger or equal than the values of all three pads
of interest in the following time bin. How exactly the pattern is arranged does not really
matter, but it must be symmetric to achieve a constant bias in one direction which could
be corrected later, if needed. Actually, it would also be possible to mirror the pattern for
every second row so that the bias is automatically removed by combining the information
of multiple rows. At a later stage, these comparison operations could also be adapted in a
70
5.3 – Cluster Reconstruction
way that not the adc values are compared but rather the differences must exceed some
threshold to suppress noise clusters.
To actually find such peaks, several algorithms were under consideration. The basic
principle though was always the same, the data previously stored in the local ram is
read back value by value and searched for peaks. But the ordering is different for the
different algorithms. One option was an adaption of the widely known divide-and-conquer
approach [58]. Actually, the whole concept as it is shown in figure 5.14 is already a kind of
divide-and-conquer approach, just at a higher level. The basic idea of the algorithm is to
divide the problem into smaller instances of the same problem. Then they are conquered
by recursively solving the subproblems by eventually further subdivisions until the problem
is a very basic one and can easily be solved. The results of the subproblems need to be
combined for the solution of the original problem . So this approach seems to fit very well
to the problem of finding all clusters in an arbitrarily sized 2-dimensional plane. The plane
is simply divided further into smaller areas until one reaches the smallest needed size of
3× 3 to determine whether a peak is present or not. It turned out that applying this type
of algorithm does not give any advantage in terms of processing time compared to simply
going through the storage and check pad after pad, at least not if the latter one is done
in an optimised way by buffering some pads temporarily. Since all clusters that could be
in memory have to be found, there is no other way than to really look at all the pads,
independent of the algorithm.
It turned out that the simplest solution is the best. The data is anyhow written pad
after pad into the storage because the mlab is only a simple dual-port ram. So by adding
a shift-register of appropriate size in front of the ram, peaks can be found already before
storing the data into the ram. Since the pads arrive always in the same order, always the
same elements of the shift-register needs to be compared with each other to achieve the
pattern shown in figure 5.15. Furthermore, there are two more reasons speaking in favour
of doing the peak finding in this way, already before storing the values. First, storing a
peak-flag with which the peak can be identified later after reading it back, together with
the data does not increase the resource usage. The mlabs are 32× 20 bit rams in which
14 bit values have to be stored. So there are anyhow 6 bit available for each value to store
additional information along with the adc value of which one bit can be used to identify
a peak. Second, by doing the needed comparisons sequentially, the number of needed
comparators can be reduced from eight to four while still comparing each pad with all its
eight neighbouring bins. This is the case because
a ≥ b ≡ !(b > a), (5.11)
which can easily be proven by setting up a truth table for all possible relations between
a and b. Coming back to the peak definition shown in figure 5.15: instead of evaluating
e.g. the equation adc(pn, tn) ≥ adc(pn−1, tn+1) it can be waited until the previous bin
(pn−1, tn+1) is shifted by enough positions and becomes the new centre and uses the inverse
of the equation adc(pn, tn) > adc(pn+1, tn−1) — which is evaluated anyway — as the result
for the relation between those two bins. With this, all the ≥-operators can be replaced by
the corresponding >-operators combined with an inverter. So the complete peak finding
algorithm is reduced to a shift register and four comparators. This reduces the complexity
71
Chapter 5 – A 2D Cluster Finder for the TPC
bit content
[13:0] adc value
14 peak, maximum in all directions and value above peak threshold
15 adc value above contribution threshold
16 minimum in pad direction
17 minimum in time direction
18 minimum in diagonal (pn−1, tn+1)⇔ (pn+1, tn−1)
19 minimum in diagonal (pn−1, tn−1)⇔ (pn+1, tn+1)
Table 5.4: Content of the data word stored in the cf memory. The 14 lsb are the
original adc value, the 6 msb contain flags for the bin characterisations.
of the system significantly because all values are just shifted through the registers in
the order they arrive and no intelligent ordering of read addresses must be developed to
minimise the overall processing time while still ensuring that all clusters are found.
In addition, this approach also opens up the possibility of directly recording all minima
together with the peak finding without additional latency. For a maximum, the only
important information is whether the current centre bin is a peak or not. Thus, the
information whether all relations of the peak definition are true and whether the adc value
is above a configurable threshold or not can be combined into a single bit. The information
about a minimum must be propagated in a more differentiated way. First, the detection
is analog to the detection of a peak in each direction. But instead of requiring that the
relation must be true both times, it must be false. E.g. in pad direction must the adc
value of the centre bin be less or equal to the value of the left pad and less than the value
on the right side. Again, this evaluation is done anyhow for the peak finding, the result is
just reused on a different way. The difference here is, compared to the peak detection, that
the results for the individual directions can not be combined because of the use case later
on. It can be used at a later stage, to split nearby clusters. If two clusters are close to each
other, their charge distributions will overlap and one of the pads will be a minimum, but
not in all directions. With the definition in figure 5.15 four directions can be distinguished,
in pad, time and two diagonal ones. The minimum flag for all those four options needs to
be propagated along with the adc value to be able to divide the charge correctly between
the two clusters if wanted. In total, five of the six available bits are used for the peak and
minimum flags. The last one will be used to indicate whether the charge of the pad is
above another configurable threshold, the contribution threshold. The charge distribution
of a cluster is in first order Gaussian shaped in pad and time direction [5]. Assuming such
a distribution, it is clear that if the charge of the bin next to the central bin is below some
small threshold, then the charge of the bin even further away should be ignored. The
charge contribution from the Gaussian distribution will be negligible and only noise would
be added. To do so, this flag can be used to avoid another comparison operation later.
The total content of the 20 bit word, which is written to the local memory of the cf, is
given in table 5.4. With that, it can later be checked with a single bit comparison whether
this currently read adc value is a peak centre, a minimum in some direction or should be
excluded from the calculation of the cluster properties.
72
5.3 – Cluster Reconstruction
p0 p1 p2 . . .
tn
tn+1
tn+2
...
(a) With 5 time bins.
p0 p1 p2 . . .
tn
tn+1
tn+2
...
(b) With 6 time bins.
p0 p1 p2 . . .
tn
tn+1
tn+2
...
(c) With 7 time bins.
p0 p1 p2 . . .
tn
tn+1
tn+2
...
(d) With 8 time bins.
Figure 5.16: Maximum number of peaks within a cf instance for different number of
time bins and always ten pads.
5.3.4 Width of the Cluster Finder Instances
It was already mentioned that the optimal size of a single cf instance still needs to be
defined. The number of pads processed by one instance and the number of buffered time
bins need to be chosen to not use too many resources — the cf grid will be one of the
biggest components of the cru fw — and to be able to process the input data continuously.
Due to the continuous readout, it is not possible to implement something like a busy
mechanism and interrupt the data taking for a short time to complete the processing of
the current chunk of data. The with a frequency of 5 MHz arriving time bins need to be
processed in time. With a baseline clock of 240 MHz, 240 MHz5 MHz = 48 ccs are available to
complete each time bin.
Since the peak finding itself is done before buffering the data, the most time consuming
part is to read all the 25 bins of the 5×5 cluster matrix after a peak is found. Simply reading
all the values of one cluster from the ram takes 25 out of the 48 ccs, more than half of the
available time. If this takes that much time, then the first question which arrises is: how
many clusters can occur in the region of interest? To answer this question, some examples
with different cf sizes are shown in figure 5.16. The number of pads is kept constant at
ten, here still for no particular reason, but the number of time bins increases from five to
eight. The inner region in which a peak is allowed to be detected is again shown by an
orange dashed rectangle and has always a distance of two bins to the borders. First, the
focus is on the example with the five time bins in figure 5.16a. Five is the number of time
73
Chapter 5 – A 2D Cluster Finder for the TPC
bins needed to complete a cluster of this size, so at least this amount needs to be buffered.
Finding a peak means that the adc value of one pad is higher than the value of all the
surrounding ones (according to the definition in figure 5.15). Therefore, there must be at
least one pad with a lower adc value in-between two peaks. This means that in the shown
case of five time bins and ten pads, at most three peaks can be found in the inner region
with a length of six pads. Example positions are shown by red rectangles in the figure. To
read those clusters, 3 · 25 ccs = 75 ccs would be needed in a simple implementation where
always all corresponding pads are read from the ram. This is far beyond the available
48 ccs, so this combination of ten pads and five time bins does not work.
Adding one additional time bin, as it is done in figure 5.16b, does not change the number
of possible peaks within the inner region, but the available processing time is doubled.
Instead of 48 ccs there are 96 ccs available because always two time bins are processed at
once, and the array shifted afterwards by two time bins. So by adding another time bin, the
processing of this amount of pads becomes possible all at once. Only after adding a seventh
time bin (figure 5.16c), the number of possible peaks within the region is increased by a
factor of two. Adding an eighth time bin (figure 5.16d) does again not change the maximum
number of peaks. This can be generalised: it is preferable to have an even number of time
bins processed (and stored) because then the number of peaks (and with that the biggest
contribution to the overall porcessing time) stays the same compared to one time bin less,
but the available number of ccs is increased. Already from this simple contemplation it can
be concluded that the configuration with six time bins could be an optimal case. It requires
the least number of buffered data, and with that the smallest resource consumption, while
having the advantage of an increased processing time available. The same considerations
can also be done in pad direction with the conclusion that also here an even amount of pads
is preferable. A complete analysis of all possible size combinations is shown in figure 5.17.
For each combination, the available processing time was taken, meaning that the number
of time bins within the central region multiplied with 48 ccs, and divided by the number
of ccs needed to read out the maximum number of clusters. The second one is just the
number of possible peaks multiplied with 25 ccs, 1 cc to read each bin. Any combination
that would be realisable (with a ratio of ≤ 1) was filled in the histogram. Those steps of
two can clearly be seen, both in time and pad direction, which were discussed before. Three
different combination could be optimal (having the same number of pads but just more
time bins is definitely worse), 5 time bins with 6 pads, 6 time bins with 10 pads and 7 time
bins with 8 pads. Further increasing the number of time bins does not seem to improve the
situation. The cf can not be made wider in pad direction, only more resources are needed
to buffer more time bins. With this argument, the last combination can also be ruled out as
the optimal one. Compared to the combination with 6 time bins, the cf is shorter in pad
direction. This is the reason why more cf instances are needed to cover all 138 pads of a
row (d138/(8− 4)e = 35§ compared to d138/(10− 4)e = 23) and in addition one more time
bin needs to be buffered. Having the cf shorter in pad direction does not only increases the
number of cf instances (the resources needed per instance is at this point still to be seen),
but also the number of doubled pads is increased. For the last combination, 7 time bins of
35× 8 pads = 280 pads need to be stored, whereas for the second combination only 6 time
§Those brackets symbolise the ceil-function, ceil(3.3) = d3.3e = 4.
74
5.3 – Cluster Reconstruction
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
 
u
se
d 
fra
ct
io
n 
of
 a
va
ila
bl
e 
CC
s
5 10 15 20 25 30 35
 time bins per CF
5
6
7
8
9
10
11
12
 
pa
ds
 p
er
 C
F
Figure 5.17: The colour code shows the used fraction of available ccs as a function of
pads and time bins per cf. Only the possible combinations with fractions of ≤ 1 are filled.
bins of 23× 10 pads = 230 pads are needed, which is a clear advantage. Doing the same
estimation for the first possible combination yields to 138/(6− 4) = 69 cf instances needed
to cover a full row and to 5 timebins of 69× 6 pads = 414 pads which need to be buffered.
Multiplying these numbers with the width of the adc value plus the needed flags of 20 bit,
one finds that 5× 414× 20 bit = 41.4 kbit need to be buffered for the first combination in
total and 6× 230× 20 bit = 27.6 kbit for the second. To conclude, the second combination
with 6 timebins and 10 pads is better in both, the number of cf instances and the size of
the needed local storage, making this the optimal size.
5.3.5 The Cluster Finder Module
After it is clear now how the cf must be dimensioned and by which method the peak
finding must be done, with the actual implementation can be continued. The central
element of the cf is the ram to store the data over 6 timebins. Before that, there is the
peak finder located, consisting of a shift-register and the necessary comparators which adds
the additional flags given in table 5.4 to the data word. This arrangement is schematically
shown in the block diagram in figure 5.18. A write controller takes care of the correct
shifting of the individual pads and calculates the write addresses for the ram. Since ten
pads need to be stored, it fits very well to use the decimal system for the read addresses.
Then, the ones digit gives the pad position within the short row segment and the tens digit
counts the time. To avoid unnecessary copy operations at the start of a new time bin, the
ram is used as a ring-buffer where the data is continuously written to, independently of
the read process. To be able to decouple the write from the read processes, there must be
enough space for two more time bins in the memory to always ensure a valid content in the
memory. The two additional time bins are simply needed because the read controller has
75
Chapter 5 – A 2D Cluster Finder for the TPC
write
controller
peak finder RAM readcontroller
/
14
/
20
write
address
read address
valid [1:0]
ADC [5:0][13:0]
valid
seq. cluster
data [13:0]
Figure 5.18: Block diagram of the cf module. A write controller takes care of the
correct shifting, the peak finder and calculates the write addresses for the ram. The read
controller checks for peaks and reads all corresponding bins of a cluster from the storage.
to wait until two complete time bins were written to the ram before it can start to process
those. Since the continuous input stream does not stop while the reading is performed, the
storage for two additional time bins is needed so that they can be written somewhere. The
total amount of needed addresses in the storage therefore sums up to 80 which fits into
three mlabs, each providing 32 addresses, instead of the previously mentioned two.
The read controller, on the other hand, takes care of the correct reading procedure. It is
implemented as a fsm and ensures that all peaks are found within the inner region of the
cf and that all bins belonging to the cluster are read in a predefined order. For this, the
addresses of the pads within the inner region are asserted to the read address port of the
ram in the order shown in figure 5.19a. This zigzag pattern, starting from the bottom left
and going to the top right, was chosen for optimisation reasons. The calculation of the
address is not different compared to other patterns because an adder needs to be involved
anyhow. In case a peak is found, two to three pads can be skipped for the search process
of the next one — depending on the time bin in which the peak is found — simply due to
the definition of a peak that two peaks can never be at two adjacent pads. If the pattern
would be more simple, like always reading a complete time bin after the other, only one pad
could be skipped or an additional logic group would be needed to keep track of the already
found peaks to omit the pads in the next time bin which were already part of a peak.
After a peak was found, the remaining 24 pads need to be read from the memory and
forwarded to the next module, the cp, to compute the cluster properties there. This order
needs to be defined und must be always the same so that the cp knows which pad arrives
when, relative to the centre. Since the output of the ram is sequential, this is also kept for
the output port of the cf, one 14 bit word after the other. The additional 6 bit of the peak
and minimum flags are not needed in the cp, since the corresponding modifications of the
adc values, like setting the bin to zero if it should be ignored, are implemented in the read
controller. This simplifies the implementation of the cp later on. The order in which the
pads are read from the memory is illustrated in figure 5.19b. This order is based on two
considerations:
1. Reading from the inside out. The inner pad (yellow) is always read first. Thus, if
this pad has the flag above contribution threshold not set, the outer pad/s (green)
76
5.3 – Cluster Reconstruction
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9
tn
tn+1
tn+2
...
(a) The central peaks are found in this or-
der. Going from the bottom left in a zigzag
pattern to the top right.
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9
tn
tn+1
tn+2
...
01
9
13 7
5 17
3
21
11
10
2
14
15 16 8 24 23
22
4
18
1920612
(b) The data of a peak is read (and for-
warded to the cp) in this order to exclude
the outer (green) bins if the inner one (yel-
low) is below the contribution threshold.
Figure 5.19: Reading order of the cf. On the left side is shown in which order it is
looked for peaks, while the right figure shows the order in which the individual bins of a
cluster are read and forwarded.
is/are set to zero instead of the actual adc value. It must be distinguished between
two cases for the operation, the vertical and horizontal cross in which the inner pad
has just one corresponding outer pad, and the four corners in which the inner pad
has three corresponding outer pads.
2. Have as few time bin crossings as possible while reading the four corners. It must be
ensured for each time bin crossing that the new address is still valid between 0 and 79.
If adding or subtracting ten for the time bin crossing leads to an invalid address, it
must be mapped back into this range. Since the memory is used as a ring-buffer, such
cases will occur. This is not an issue for the previously discussed zigzag pattern. The
start address is set to an even time bin address at the very beginning. Afterwards,
there are only four possible cases for the start address at the bottom left: 2, 22, 42
or 62. In each of these options, the second time bin (address +10) is always within
the valid address range.
To have all information of a cluster combined, four more items need to be attached to
the sequential cluster data stream, the pad and time position of the central pad and the
row number of this cf. Also, some additional flags are foreseen which are not used at the
moment but can be utilised at a later time for example to flag clusters were the charge was
split between two nearby ones. The pad position of the peak is a 8 bit number to cover the
full range of 138 pads and is the sum of an external given pad offset of the cf instance and
the pad within the memory. The time is counted internally in the cf whenever a new time
bin is added and needs to be a 9 bit number. This time is measured within the so called
Heart Beat (hb) frame which has a length of 89.4µs. With the sampling rate of 5 MHz,
this hb frame consists of 447 timebins. The externally set row number needs to cover the
range from 0 to 17 and is therefore a 5 bit number. Finally, the flags are set to 8 bit due
77
Chapter 5 – A 2D Cluster Finder for the TPC
item content
0 adc value 0
1 H0 (”000000000” & row[4:0])
2 adc value 1
3 adc value 2
...
...
24 adc value 23
25 adc value 24
26 H1 (flags[7:4] & ”00” & pad peak[7:0])
27 H2 (flags[3:0] & ”0” & time peak[8:0])
Table 5.5: Sequence of the cf output data stream. In addition to the adc values of the
individual pads, there are the row number, the pad and time position of the central peak
and additional flags inserted as header words H0 to H2.
to the cdf which is described in subsection 5.3.7. This sums up to 30 additional bits, or
three additional 14 bit data words which are inserted in the output stream. The sequence
is given in table 5.5. The row number is sent as the second item after the adc value of
the peak because it is always available (it is supposed to be a constant after the initial
configuration of the cru) and can therefore be used to utilise this one cc. After the peak
is detected, the correct read address for the pad marked with a 1 in figure 5.19b needs to
be asserted to the ram. Until the value is returned, one cc of latency has passed which
is not wasted in this way. The pad position of the peak needs to be calculated first from
the read address and is therefore available only at a later stage. Also, the flags are filled
only while reading the content of the cluster. Those are appended together with the time
information at the very end.
5.3.6 Optimising the Cluster Finder Grid
Before continuing with the processing of the cluster data, there is another optimisation to
be done. It was seen that 23 cf instances are needed to cover all 138 pads of a row with
maximum length and the individual regions of a sector have up to 18 rows. Multiplying
those two numbers would indicate that 414 cf instances are needed to cover the maximum
size area. However, it is possible to improve the numbers since not every region has
those 18 rows and not every row — not even in the outermost region — has really the
maximum number of 138 pads. All the cf instances which are actually needed are sketched
in figure 5.20. The 18 rows are shown vertically and the 23 cf instances horizontally as
grey boxes with a width of six pads corresponding to the number of pads within the inner
region of each instance. The maximum pad number which is shown, was found by going
through all the ten regions and taking the maximum number of pads for each individual
row. Since clearly only those pads need to be covered, it would be a waste of resources
to also instantiate cf instances for the remaining 38 white boxes, leaving only the 376
coloured cfs. However, there is still room for improvement. Those cfs are not needed all
at the same time. Having a look at the pad planes of the regions individually (figure B.1 to
78
5.3 – Cluster Reconstruction
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
3
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
6
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
7
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
11
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
12
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
13
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
16
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
17
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
ro
w
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 CF
Figure 5.20: The cf grid covering all required pads. Each rectangle represents the
inner region of a single cf instance. The 38 white instances are not needed, there is no
region with the corresponding row with this number of pads. Those regions where the
green marked fields are needed, the yellow ones are not and vice versa. So the 101 green
marked cfs can be reused at the positions which are marked in yellow.
79
Chapter 5 – A 2D Cluster Finder for the TPC
figure B.4), one finds that there are some with many rows and few pads per row and others
with a smaller number of rows but therefore many pads per row. For example, region 4 has
all the 18 rows but only a maximum of 84 pads in those rows, in contrast to region 9 with
only 12 rows but therefore up to 138 pads. Those coloured boxes are not all needed in a
single region. Therefore, it can be tried to instantiate a cf only once in the fw and use
it in different configured regions for different areas of the pad plane. Requiring that one
instance is used at most for two different positions — to keep the combinatorics as small
as possible — the pattern shown in figure 5.20 can be achieved. The 101 green marked
instances can safely be reused in the yellow positions without influencing the functionality.
These positions are never used both within a single region while the blue ones are needed in
most of the regions. The exact mapping between the green and the yellow boxes, depending
on the individual region, is not shown for better visibility.
With those two optimisation steps, the number of instances needed can be reduced by
more than 33 %, from 414 to only 275 while requiring that a cf has at most two different
data sources. This number of 275 is only 3 cf more than the absolute minimum of 272 cf
which are needed in regions 6 and 7 to cover all pads available there.
5.3.7 Calculating the Cluster Properties
Five quantities need to be calculated for each cluster. First, the total charge
qtot =
24∑
i=0
qi (5.12)
which is the sum over all adc values of the 5× 5 cluster matrix. Since the total charge
depends on the energy deposit of the original particle, which is described by the Bethe-
formula [27], it can be used for Particle Identification (pid). In the following, all summations
over the index i are done from i = 0 to i = 24. The four further quantities are for the
position information of the cluster. The actual position is given by the so called centre
of gravity in pad and time direction which is the weighted mean. It is calculated in the
following way:
µx =
∑
i qi · xi
qtot
, (5.13)
where xi stands for either pi or ti, the individual pad numbers and time bins of the charges
qi. The resolution of the position, given in units of pad and timelines, can be expressed by
the standard deviation of the weighted mean:
σ2x =
∑
i qi · (xi − µx)2
qtot
(5.14)
=
∑
i qi · x2i
qtot
− µ2x. (5.15)
The second form is easily derived by expanding the square in the first equation or by
directly using the identity Var(X) = E[(X − E(X))2] = E(X2)− [E(X)]2 [59]. Equation
5.15 is better suited for an implementation in an fpga because the mean value needs to
80
5.3 – Cluster Reconstruction
be known only at the very end and not in each summand. In general, those sums are
very well suited for an implementation in an fpga because the accumulation can be split
across several ccs. Although the summations are well suited, the multiplications, squares
and divisions of equation 5.13 and 5.15 are not. Since each arithmetic operation must be
reduced to basic bit operations, more complex calculations require a lot of resources for
the implementation and time (in form of ccs) for the execution, especially when it comes
to numbers with many bits. By expressing the position relative to the central peak, the
calculation of the mean value can also be done in the following way:
µx =
∑
i qi · xi
qtot
(5.16)
=
∑
i qi · (x+ δxi)
qtot
(5.17)
= x+
∑
i qi · δxi
qtot
, (5.18)
where x is the pad or time position of the central bin and δxi ∈ {±2,±1, 0} is the distance
between the peak and the individual charges. This reduces the multiplication to a bit-shift
operation by one in case of a multiplication by two. The sign can be implemented either
by calculating the two’s complement of the eventually shifted charge or by using an adder
which can also be configured as a subtractor for the calculation. An additional benefit
is that also the number of summands decreases since there is five times a multiplication
by zero involved after expressing the position relative to the peak. The same can also be
applied to the calculation of the resolution:
σ2x =
∑
i qi · (x+ δxi)2
qtot
− µ2x (5.19)
=
∑
i qi ·
(
x2 + 2x · δxi + (δxi)2
)
qtot
− µ2x (5.20)
= x2 + 2x ·
∑
i qi · δxi
qtot
+
∑
i qi · (δxi)2
qtot
− µ2x. (5.21)
By using equation 5.18, the calculation can further be reduced to:
σ2x = x2 + 2x · (µx − x) +
∑
i qi · (δxi)2
qtot
− µ2x (5.22)
= x2 − 2x2 + 2xµx − µ2x +
∑
i qi · (δxi)2
qtot
(5.23)
= −(x− µx)2 +
∑
i qi · (δxi)2
qtot
(5.24)
and again using equation 5.18 leads to
σ2x =
∑
i qi · (δxi)2
qtot
−
(∑
i qi · δxi
qtot
)2
. (5.25)
81
Chapter 5 – A 2D Cluster Finder for the TPC
It can be seen that the dependency on the central pad drops out completely and the
resolution depends only on δxi and (δxi)2 ∈ {4, 1, 0}, apart from the charges qi and qtot.
The sum of the second part was already calculated for µx and can be reused.
Since all the clusters need to be accessed in the flp anyway, for a reordering, aggregation
and a compression before sending them over the network, it is possible to outsource parts
of the calculation to the cpu of the flp. In particular, the divisions by qtot and the square
root operation to get σx can be done in software more easily. Thus it was agreed that the
summations, given in equation 5.26, are done as precalculations in the cru:
qtot =
∑
i
qi
µp,pre =
∑
i
qi · δpi
µt,pre =
∑
i
qi · δti
σp,pre =
∑
i
qi · (δpi)2
σt,pre =
∑
i
qi · (δti)2.
(5.26)
The values are then shipped to the flp where the final calculations, given in equation 5.27,
are done during the data handling in the cpu:
µp = p+
µp,pre
qtot
µt = t+
µt,pre
qtot
σp =
√
σp,pre
qtot
−
(
µp,pre
qtot
)2
σt =
√
σt,pre
qtot
−
(
µt,pre
qtot
)2
.
(5.27)
In addition to those five quantities, also the position of the peak in pad and time direction
needs to be included in the cdf, as well as the row number, the already mentioned flags
and the charge of the peak, the so called qmax. The qmax depends also on the original
energy loss of the particle and can therefore be used as a complementary method to identify
the particle. The bit width needed for this value is in the order of the original 10 bit
adc value but one additional bit is kept for a better precision due to the already applied
baseline correction. The widths of the flags, the row and the peak positions were already
discussed. For the other five quantities, it was decided to keep the maximum precision
in order not to bias the computations in the flp. This means that for the total charge
qtot, to which up to twenty-five 14 bit values can contribute, a width of 19 bit is needed.
For the precalculations of µx,pre, this looks different. Going through the 5× 5 matrix of
the cluster, one finds that ten values will contribute with a factor of |δxi| = 1 and ten
values with a factor of |δxi| = 2. The remaining five bins are not taken into account
because of a factor of |δxi| = 0. The maximum possible value fits therefore in a 19 bit
number. One additional bit is needed to encode the sign, since the value could also be
negative, giving in the end a 20 bit number. The σx,pre, on the other hand, has a factor
of δxi = 4 involved, this is why the maximum possible result is also a 20 bit number.
However, this time without a sign, since all multiplications are done with positive numbers.
Taking all these widths into account one finds that the cdf needs to have at least 140 bit.
Since it is very convenient to work in software with 32 bit words, the size of a cdf was
fixed to five words, giving a space of 160 bit for the cluster with 20 spare bits for future
developments. How the individual numbers are distributed within these words is shown
in figure 5.21.
82
5.3 – Cluster Reconstruction
012345678910111213141516171819202122232425262728293031
p µp,preword 0
t µt,preword 1
qmax σp,preword 2
row σt,preword 3
flags qtotword 4
Figure 5.21: The Cluster Data Format (cdf), consisting of five 32 bit words, contains
all relevant information about a cluster. The grey marked bits are currently unused and
must be set to 0.
Since the multiplications are trivial now, again an accumulation problem remains. This
kind of problem was solved before efficiently by utilising dsp blocks in the cm calculation.
Since the individual sums of equation 5.26 are decoupled from each other, it makes sense
to use one dsp block for each. In this case, again the “18 × 18 Sum of 2” mode of the
available dsps, shown already in figure 5.11a, has exactly the form which is needed. It
has many inputs, the pre-adder can be used either as adder or as subtractor and it can
accumulate internally a lot of data. The input port widths of 18 bit and 19 bit are also fine,
even by shifting the 14 bit input adc value by two bits for the multiplication by four and
the output port width of up to 64 bit is also sufficient for each of the results of the sums.
The question arises how to utilise all of the four input ports (two for both pre-adders) in
order to use the dsps efficiently. The solution is to put an appropriate fifo in-between
the cf and the cp. There are a few reasons speaking in favour of using an m20k for this.
First, it is rather big which reduces the danger of lost clusters due to a buffer overflow.
Second, and more important, those ram blocks can be configured with different widths of
the write and read ports. Having the read port four times as wide as the write port allows
to read four adc values in one cc. This factor of four fits perfectly the number of input
ports of the dsps. Even the number of words that belong to a cluster is 28, which can be
divided without remainder by four. So such a configuration of having a mixed-width fifo
in the beginning to always read four values at once and then a dsp which can accumulate
four values at once, fits perfectly. With this, the cp can also process the data four times
as fast as the data is transmitted by the cf. Keeping this in mind, each cp instance can
be used to process and collect the output of four cf instances. Since the data of all the
275 cfs need to be merged anyhow at some point to end up with two pcie endpoints, this
is a perfect first step of doing so. Then, additionally it reduces the resource consumption,
especially the number of needed dsps, if not every cf needs its individual cp. Having
multiple inputs to each cp requires then again a fifo for each input which are processed
round-robin.
With this, the cp can be designed as shown in figure 5.22. Four input fifos are shown,
each is filled by a different cf. The fifos are emptied by a controller which takes care
that always a fifo with some content is read. In addition, those which are almost full are
preferred for the next reading cycle. It also feeds the individual dsps, each responsible to
accumulate the charges for a different sum and writes the formed cdf to an output fifo.
83
Chapter 5 – A 2D Cluster Finder for the TPC
input FIFO
input FIFO
input FIFO
input FIFO
controller
DS
P
µ
p
,p
re
DS
P
q t
ot
DS
P
µ
t,
pr
e
DS
P
σ
p
,p
re
DS
P
σ
t,
pr
e
output FIFOcluster
CF 0
CF 1
CF 2
CF 3
Figure 5.22: Block diagram of the cp. Each cf writes its data to a fifo. A single cp
instance reads from four of those round-robin, calculates the cluster properties with the
help of dsp blocks and forms the cdf which is written to an output fifo.
The controller itself is composed of two small fsms, one for the reading process and one
for the calculation and write process.
The read-fsm must ensure that always 28/4 = 7 words are read from the same fifo.
Otherwise the charges of different clusters would get mixed up. Additionally it needs to
always find the next fifo with content to ensure an efficient processing while the fifo which
is almost full must be preferred. The fsm also takes care of the masking of the input fifos
if a so called hb-pattern was seen, until this was seen in all input fifos. This is necessary to
provide synchronicity by keeping all clusters belonging to the same hb-frame together. Each
of the cf is working independently of the others. They all receive the hb signal which resets
the internal time counter and triggers the generation of this special reset-pattern. It is a
56 bit wide pattern (4× 14 bit) which is inserted into the normal data stream. Doing it this
way makes it on one hand necessary to detect the pattern but on the other hand simplifies
the data handling significantly because no additional path needs to be kept synchronously
within the individual data streams. The controller is designed in a way that it could in
principle handle an arbitrary number of input paths, one just needs to ensure that the data
can be processed fast enough in order not to generate back-pressure to the cfs. The output
of the cp is again a fifo, making it easily possible to put the cp in an individual clock
domain, faster than the baseline clock of 240 MHz. The Clock-Domain Crossing (cdc)
is then done with dual-clock fifos on both sides. By increasing the frequency by 1/4 to
300 MHz, one could add a fifth input path which would reduce the number of needed cps
from d275/4e = 69 to d275/5e = 55 and with that the needed resources. However, it was
found that there are enough resources available for the current design and that therefore
this optimisation is not needed at the moment. Keeping a lower frequency reduces the
84
5.3 – Cluster Reconstruction
11 12 6 20 19
10 9 5 17 18
2 1 0 3 4
14 13 7 21 22
15 16 8 24 23
−2 −1 0 +1 +2
+4 +1 0 +1 +4
−2
−1
+1
+2
0
+4
+1
+1
+4
0
δx
i(δx
i ) 2
t
p
(a) The cluster matrix with the δxi and
(δxi)2 next to the corresponding bins.
0H012
3456
78910
11121314
15161718
19202122
2324H1H2
055
word 0
word 1
word 2
word 3
word 4
word 5
word 6
(b) The seven data words from the cp
input fifo, containing 4 adc values each.
Figure 5.23: Mapping of the cluster data to the cp input fifo.
effort needed by the fit and routing algorithms during the compilation procedure to achieve
timing closure. It should be kept in mind that the module is written in this configurable
way and the number of inputs can be set by a simple generic parameter.
The calculation-fsm, on the other side, must ensure that the input data to all four ports
of the five dsps is always valid. From the input fifos, always four adc values are read
at once, so they must fit to the requirements of the dsps. For the qtot summation, this is
straight forward. The corresponding dsp is configured that all four input ports are summed
up (both Pre-adders are configured for an adding operation) and accumulated over seven
ccs. Only the header data, containing the additional cluster information, in word 0 and
word 6 of figure 5.23b must be excluded by setting the input to 0 in those cases. This figure
shows how the individual cluster bins are grouped into the seven words of the mixed-width
fifo output. Each word contains four adc values. The readout order of the cluster charges
is shown again in figure 5.23a. Starting from the centre, first the horizontal and vertical
axes are read, afterwards the four corners in the shown order. The matching pre-factors
for the mean and sigma summations (δxi and (δxi)2) are written below for the calculation
in pad direction and on the left side for the calculation in time direction. They become
important now for the computation of the weighted mean. For the sake of simplicity, we
concentrate on the pad direction. It works analogously in time direction. To calculate the
sum, each of the pads in the columns above |δxi| = 2 needs to be multiplied by two and
then either subtracted from or added to the sum. To subtract a value, one could either
calculate the two’s complement which is then added, or use the available Pre-adders of
the dsp in subtraction mode. Using the second option gives a dsp with two input ports
which are always added to the sum and two ports which are always subtracted from the
sum. This appears in principle to be perfectly fine since also half of the bins need to be
added and the other half to be subtracted. However, to make the whole process work,
there must be at most two values for the addition and two values for subtraction in each
word because the setting of the Pre-adder must be known at compile time and can not be
85
Chapter 5 – A 2D Cluster Finder for the TPC
δxi (δxi)2 word 0 word 1 word 2 word 3 word 4 word 5 word 6
(−2) (+4) 2 10 11 ,14 15
(−1) (+1) 1 9 12 ,13 16
(+1) (+1) 3 17 20 ,21 24
(+2) (+4) 4 18 19 ,22 23
(a) Advancing individual bins to fit the requirements of the cp processing in pad direction.
The encircled numbers are shifted one word ahead (or the current word is delayed by one).
δxi (δxi)2 word 0 word 1 word 2 word 3 word 4 word 5 word 6
(−2) (+4) 6 11 ,12 19 ,20
(−1) (+1) 5 9 ,10 17 ,18
(+1) (+1) 7 13 ,14 21 ,22
(+2) (+4) 8 15 ,16 24 ,23
(b) Advancing individual bins to fit the requirements of the cp processing in time direction.
The encircled numbers are shifted one word ahead (or the current word is delayed by one).
Table 5.6: Mapping of the cluster data to the dsp ports.
changed during run time. Unfortunately, that is not the case, as can be seen in table 5.6a
for the pad direction (and table 5.6b for the time direction). Going through the individual
words from 0 to 6, the relevant bins — those not multiplied with zero — are sorted into the
row with the correct pre-factors. The problematic ones are words 3 and 5 in pad direction,
where four values are contained which all need to be subtracted or added. To overcome
this issue, a shift register can be build with only two elements for the words. With this, it
is possible to select from two consecutive words for the dsp input, making it possible to
effectively shift some values to the previous word. By applying those shiftings, which are
indicated by arrows, to the encircled bins, it can also be achieved that indeed in each word
at most two values are contained which need to be subtracted and two which need to be
added to the total sum. Further, it can also be achieved that only a single value needs
to be multiplied by +2 and a single value multiplied by −2 in each word. This allows to
hardcode the bit-shifting for the multiplications and assign them to the individual ports
of the dsp. To summarise, with the trick of delaying the words by one cc, a dsp can be
utilised were one input is used for the contributions with δxi = −2, one port with δxi = −1,
one port with δxi = +1 and one port with δxi = +2, and thus compute µp,pre.
The same method can also be used for the σp,pre dsp with the only difference that the
Pre-adders are not used for a subtraction and that the multiplication-bit-shift is by two
instead of one. The calculation in time direction is analogous, only other bins (shown in
table 5.6b) need to be assigned to the respective dsp ports. With this, all the needed
cluster properties can be calculated in an efficient and resource-saving way.
The cf is required to insert a special pattern in the data stream upon receiving this hb
trigger. After the cdf is formed, such an approach is not needed anymore. From here
on, a single not yet used bit in the flags-field is used to identify a so called reset-cluster
in the following logic. After all input fifos present the reset-pattern, such a reset-cluster
86
5.3 – Cluster Reconstruction
FIFO
FIFO
FIFO
FIFO
FIFO
controller FIFO
16 mergers in stage 1
FIFO
FIFO
FIFO
controller FIFO
4 mergers in stage 2
Figure 5.24: The fifo merging network. The sixteen mergers of stage one working with
four or five input fifos while the four mergers of stage two all have four input fifos.
is generated and written to the output fifo to mark the transition from one hb to the
next one.
5.3.8 Cluster Merging Network
The last component of the ul, before the data can finally be transmitted by the dma
engine, is the merging network. As described before, there is a huge cf grid consisting of
275 independent instances. A first step of merging the individual output data streams is
already done with the cps. Each of those modules handles the data from four cfs and
merges the corresponding streams to one. But there are still 69 cps left and only two pcie
endpoints. Those two pcie endpoints have a bus width of 256 bit each. The data which
needs to be transferred are clusters with a width of 160 bit. It can directly be seen that to
utilise the available bandwidth to the host machine most efficiently, access to two clusters
is needed at the same time to fill the available 256 bit. Since those two endpoints need to
be independent from the point of view of the software in the flp, a cluster can not be
split and some part of the data written to one and another part to the other endpoint.
Therefore, access to two clusters is needed for each of the endpoints or to four final output
streams. That means, the 69 streams of the cps need to be merged to four final streams.
This does not need to be done in one step but can be implemented in multiple stages.
By reusing parts of the logic of the cp, small merger instances can be build which read
the clusters from 4 or 5 fifos and write them into a single one. With those sizes, the
merging is achieved in two stages (69⇒ 16⇒ 4), which is indicated in figure 5.24. The
first stage of the merging network needs in total sixteen mergers, eleven instances working
with four inputs and the remaining five with five different input fifos. The second stage
consists of four final mergers, each combining the data of four streams.
A single merger consist of two parts, a controller and an output fifo. The controller
takes care of selecting the correct input and detection of the reset-cluster, after which
87
Chapter 5 – A 2D Cluster Finder for the TPC
Family Arria 10
Device 10AX115S3F45E2SG
Timing Models Final
Logic utilisation (in alms) 362 946 / 427 200 ( 85 % )
Total registers 638 092
Total pins 342 / 960 ( 36 % )
Total virtual pins 0
Total block memory bits 26 633 212 / 55 562 240 ( 48 % )
Total ram blocks 2 314 / 2 713 ( 85 % )
Total dsp blocks 416 / 1 518 ( 27 % )
Total hssi rx channels 41 / 72 ( 57 % )
Total hssi tx channels 41 / 72 ( 57 % )
Total plls 56 / 144 ( 39 % )
Table 5.7: Fit result summary of the complete fw with the tpc ul.
the corresponding input is masked until this cluster was seen in all inputs, identical to
the logic in the cp. Then again, a new reset cluster is generated at the output and the
emptying of the input fifos is continued. The module is written in a very generic way.
The number of inputs is freely configurable (during compile time) which makes it easy to
optimise the network according to the required bandwidth and fluctuations in data rate.
All the fifos are dual-clock fifos which opens the possibility, again similar to the cp, to
increase the clock frequency in this part of the ul to speed-up the merging if needed. The
cdc is handled again by the fifos. In general, it is sufficient for the intermediate fifos to
be rather shallow, they are only needed to implement the merging in a very convenient
way. So if they run full the logic introduces back-pressure on the previous stages, including
the cp. Therefore, it is sufficient if the very last and the very first fifos are deep: the very
first ones to compensate for this back-pressure and the very last one to have enough data
available to ensure that the dma engine does not run empty and precious transmission time
is lost. For those two cases, two different fifos are implemented in the module which can
be selected with a generic parameter: a shallow one with 32 elements and a deep one with
512 elements. With that, the same module can be used for all stages. Since the controller
works only with the status signals of the fifo, like the full and empty flags, the merger
module can easily be extended with differently sized fifos if at any time in the future it is
seen that another size would be more appropriate.
5.4 Resource Consumption
After all modules are ready, the fw can be compiled for the cru. To anticipate the outcome,
the fit is successful and there are enough resources for all the logic of the tpc ul, although
the fpga is quite full with a utilisation of 85 % of the available Logic Elements (les). The fit
summary as it is usually output by the software¶ is shown in table 5.7. The design which was
¶Intel Quartus Prime 17.0 Pro edition was used.
88
5.4 – Resource Consumption
Module alms % m20ks % dsps %
ul glue logic 21.2 0.0 0 0.0 0 0.0
Link mux 2 875.6 0.7 0 0.0 0 0.0
gbt decoder 10 263.7 2.4 0 0.0 0 0.0
Pre-sorter 3 889.3 0.9 40 1.5 0 0.0
Row-segment merger 40 067.4 9.4 0 0.0 0 0.0
blc 4 248.0 1.0 116 4.3 54 3.6
Clusteriser 167 905.4 39.3 906 33.4 362 23.8
Glue logic 2 251.0 0.5 0 0.0 0 0.0
All 275 cfs 119 219.5 27.9 0 0.0 17 1.1
All 69 cps 41 688.0 9.8 826 30.4 345 22.7
All 20 fifo merger 4 749.2 1.1 80 2.9 0 0.0
Readout gate 3 889.7 0.9 0 0.0 0 0.0
Configuration 10 024.0 2.3 0 0.0 0 0.0∑ 243 184.3 56.9 1 062 39.1 416 27.4
Table 5.8: Resource consumption of the individual modules of the tpc ul. The modules
of the clusteriser are listed separately.
fitted contains the complete tpc ul and also the common parts of the fw for 24 gbt links.
The only missing module is the one for the cm calculation module since the implementation
itself was not part of this thesis and the module is not yet ready. Therefore it could not be
included in the fit. Although the fpga is already quite full with a usage of 85 % of the basic
les, the alms, there is still enough space for this additional module. Test compilations
showed that the cm calculation module should not need more than a few percent of the
alms, 21 dsps for the accumulation and maybe some ram cells to store the pedestal values.
Table 5.8 lists the resource consumption for the individual modules of the ul in a
more differentiated way. For each group of modules, there is the number of used alms,
the number of used m20ks and the number of used dsp blocks given, together with the
percentage share of all available resources in the fpga. With this, it can nicely be seen
which modules are the biggest consumers. First, there are three more modules listed which
were not discussed before in detail, because their implementation is rather simple. That is a
link Multiplexer (mux) with which the 24 input links can be freely assigned to one of the 20
implemented processing paths. There are 24 input links used to provide a higher flexibility
during the connection of the fee and because one fibre trunk contains 24 individual optical
fibres. The readout gate is also kept very simple: it composes the Raw Data Header (rdh)
and combines the four final fifos to the two dma endpoints (two times 2 ⇒ 1) via a
small fsm. The configuration module is somewhat more extended, as can be seen in the
consumption of the alms but is also very simple. It provides the registers to read from
and write to by instantiating the corresponding modules provided by the central team.
Otherwise, the result is as expected. There are two big consumers, the clusteriser and
next the row-segment merger, while the other modules are in principle negligible in terms
89
Chapter 5 – A 2D Cluster Finder for the TPC
of resource consumption. The 119 219.5 alms (or 27.9 %) which are needed for the cfs is
first a huge number. However, one must consider that those resources are needed for 275
individual modules, leading to an average consumption of 119219.5 alms/275 = 433.5 alms
per cf which is a reasonable number, keeping in mind that the cf consist of a shift register
for a whole time bin, a comparator array for the peak finding, the local storage for eight
time bins and the relatively complex read controller.
The next type of resources are the dsp blocks. The result is close to the expectation.
The 18 blc modules need three dsp blocks each, giving in total the shown 54. The 69 cps
use five blocks each to accumulate the five different sums which results in 345 needed dsps.
An interesting observation is that in 17 randomly distributed cfs a constant multiplication
by 6 which is needed to calculate the pad offset was implemented by the fitter, using a dsp
block. This was probably done to reduce the consumption of alms and might be a place
for further optimisations. Since so far only 27 % of the dsp blocks are used (there are more
than 1 000 still available), one could do the complete pad offset calculation (multiplying
the externally given cf number within a row by 6 and subtracting 2) in all cfs in this way
and save some alms with that.
In order to interpret the consumption of the m20ks, it is important to note that in
some cases the fitting algorithm can reduce the utilisation of alms by increasing the usage
of m20ks blocks. This is particularly the case when it comes to the implementation of
memory. There is also the obvious case of the 20 pre-sorters, each using two m20ks in a
ping-pong ram configuration, giving exactly the shown 40 blocks. The 116 used rams in
the blc modules are initially unexpected, as this means that on average each module uses
116/18 = 6.56 blocks, although it was not intended that the modules would use any at all.
The only memory that should be used here is for the lut and should be implemented in
mlabs since they have only few entries. However, the configuration of the instantiated ip
core for the rams was that it is explicitly left to the fitter to choose the type of ram to be
used. In this case, the algorithm used five m20ks instead of mlabs for the implementation
of this luts. The additional one or two ram blocks are used in some of the modules for a
24 bit shift register which was implemented to delay the valid signals by five ccs until the
blc calculations are done. Both decisions of the fitter can be interpreted as an attempt
to reduce the alm consumption by increasing the use of m20ks. The clusteriser is the
biggest consumer also of those resources. Each of the final four merger used indeed four
m20ks as it should be, to implement a deep fifo in this place. The width of 160 bit of a
cluster requires to use four blocks, each with a width of 40 bit. However, also the fifos
of the mergers of the other stage are implemented by using m20ks. Again, the cores are
configured so that it is up to the fitting algorithm to select the ram type. Before going
into the final system, one should maybe explicitly set the type here as well to the m20k
because the fitter uses them to reduce the alm consumption, so why not using all the
advantages and make the fifos deeper also for the intermediate mergers. Having in mind
that the maximum width of a m20k is 40 bit, then the consumption of the cps is also no
surprise. The output fifo was intended to be a shallow one but also here are four m20ks
used by the fitter for each cp. The input fifos, in total one per cf, had to be a m20k
because of the mixed-width configuration. Since the output port had to have a width of
4 · 14 bit = 56 bit, two blocks are needed. This gives in total 275 · 2 + 69 · 4 = 826 blocks.
In summary, there are no surprises and the overall resource consumption is reasonable.
90
5.4 – Resource Consumption
Figure 5.25 shows a visualisation of the used resources of the fpga by the individual
modules, generated with the chip planer tool of the Quartus software. As can be seen in
figure 5.25a, there is a big region in the centre of the chip with a lot of interconnections,
shown in pink. The utilisation of the routing resources is at these locations close to 100 %.
The origin of those is shown in figure 5.25b and figure 5.25c. The part on the bottom right
is due to the gbt wrappers where the used les are indicated in blue which are placed close
to the io-cells (Input/Output-cells). The part of the high routing utilisation in the centre
of the fpga comes from the row-segment merger, shown in cyan. The dma engine (in red)
was placed by the algorithm again close to the io-cells. The green elements symbolise the
pre-sorter logic which is located between the gbt wrapper and the row-segment merger.
The complete top part of the fpga is occupied by the clusteriser, shown in yellow, placed
around the merger module.
91
Chapter 5 – A 2D Cluster Finder for the TPC
(a) Routing utilisation. (b) Used les by the dma engine in red
and by the gbt wrapper in blue.
(c) Used les by the pre-sorter in green
and by the row-segment merger in cyan.
(d) Used les by the clusteriser in yellow.
Figure 5.25: Visualisation of the utilisation of the fpga. The different module groups
are shown in different colours on top of the general utilisation of the routing resources.
92
Chapter 6
Performance and Validation
The majority of the validation of the User Logic (ul) must take place in simulation due
to the timeline of the tpc upgrade program. To be able to test all modules together in a
Common Readout Unit (cru), the upgraded tpc would be needed: a drift volume which
generates realistic clusters, a gem amplification system and the new Front-End Cards
(fecs) generating the gbt frames with the data content, read out by a cru with the
tpc ul included in the Firmware (fw). Further, to fully qualify the processing which
was done in the cru, the alice O2 framework is needed for the decoding of the data
stream from the cru, tracking and Quality Assurance (qa) tasks. But since the tpc
will not be ready for such tests before the end of 2019 [32], most modules can only be
validated in simulation. The simulation results are discussed in section 6.1. However,
there was a test beam campaign in May 2017 to test a preproduction Inner Readout
Chamber (iroc) and six of the new fecs at the cern Proton Synchrotron (ps). This was
also used to test the first modules of the ul, the gbt decoding, and verify their correct
functionality. The procedure and the limitations are described in section 6.2. Since the
Cluster Finder (cf) algorithm was also implemented in software for the O2 framework,
the unavoidable differences between the two implementations and the performance are
discussed in section 6.3.
6.1 Performance of the User Logic Modules in Simulation
A validation of the fw with a real system during the development phase was excluded
already from the beginning simply due to the time scale of the upgrade program. The
development of the ul was started well ahead to have the fw ready as soon as the detector
is upgraded. With this it will be possible to read out the detector as soon as the gem
based rocs are mounted and the electronics is installed. The disadvantage for the fw
development is then clearly that one has to rely on a proper simulation. Therefore, great
emphasis was put on the simulation of each individual module. An individual test bench
is written for each element of the ul, testing all the provided functionalities and each
expected source of error. Those test benches are then simulated with ModelSim [60] and
the behaviour of the Unit Under Test (uut) automatically compared to the expected ones,
if possible. These simulations are used not only for the development and debugging, but are
93
Chapter 6 – Performance and Validation
also executed automatically without manual intervention after the code in the repository
has changed. If even one of the simulations fails, an automatic build of the fw is prevented.
This ensures that the provided fw is always fully functional and behaves as expected.
6.1.1 Decoding of the GBT Frames
The decoding of the gbt frames is the basis of all the processing steps afterwards. As
already described in subsection 5.2.1, the decoder itself is quite a complex entity. It
consists of three identical modules to monitor the adc sampling clock, must reassemble
the Half-Words (hws) from the sampas, decode the five individual data streams which
implies a detection of the synchronisation pattern, and finally provides a defined output
stream by merging the result of the five channel decoders. So in addition to the combined
simulation, those smaller modules are simulated individually to reduce the complexity.
Clock Monitor
This module must provide two basic functionalities, first the general recognition of the
adc clock in all four possible phases. The module must lock to the rising edge and check
that all subsequent patterns are as expected. Second, an error must be reported if the
signal on the input port does not follow the correct pattern.
This can easily be simulated by feeding the right sequences into the module, one phase
after the other. The module correctly locks to the detected rising edge of the input pattern.
After a complete sequence, the error flag is taken down, indicating that the adc clock is
now detected as a valid one. During the switching between the phases an error is reported
because the pattern is now a different one. Afterwards, the module correctly locks to the
new one. This behaviour is also compared to a predefined pattern for the automatic test
cases. In addition, random bit-errors are introduced which are also correctly found by the
clock monitor.
Detection of the Synchronisation Pattern
The basis of the decoding of the channel number is the detection of the synchronisation
pattern. Again, the pattern can occur in all four possible phases. So it must always be
detected and the correct phase must be reported to the higher-level module. The approach
is the same as for the clock monitor, the correct sequences are applied to the module, one
phase after the other. The module reliably recognises every pattern and reports the correct
phase together with the signal for a completed sequence.
Channel Decoder
The channel decoder combines the detection of the synchronisation pattern with the
assembly of the 10 bit adc values. This assembling needs the phase of the synchronisation
pattern to couple the correct hws together, also involving a buffering of the last received
hws for the case that the data is split across two gbt frames (compare sequences 1 to
3 of figure 5.5.) In addition, the channel numbers must be counted with which an id is
generated.
94
6.1 – Performance of the User Logic Modules in Simulation
Figure 6.1: Simulation of a complete gbt decoder with ModelSim. The image shows
the data interface of the simulated module (cf. with the timing diagram in figure 5.6).
To show the correct functionality, again all four phases are tested after each other. An
hw stream is generated, consisting of the synchronisation pattern at the beginning and
adc values afterwards. Since it does not matter which adc values are decoded, it is very
convenient to use a counter for the adc values for the automatised testing procedure. Each
decoded value is then greater by one than the previous one. This allows an automatic
verification of the reconstructed adc value without keeping track of the input data.
The decoder reliably detects the synchronisation pattern and provides the counter values
on the two output ports, together with a valid flag and an id which indicates the channel
number. By simply counting how often the valid flag was present, it is verified that the
correct adc value was decoded. Port 0 of the decoder then shows a number, always twice
as high as the valid-flag-counter, and port 1 shows a number which is higher by one. Since
the id is in principle also just a counter form zero to seven, it can be verified with the
decoded values. Both start at the same time with zero, that is why id = (adc/2) mod 8
must always be true and can therefore be used for the verification.
The complete GBT Decoder
After the individual components are successfully simulated, a complete gbt decoder can
be validated. The input to a gbt decoder are, apart from the configuration, the gbt
frames. They must be generated according to the description in subsection 5.2.1. For
the automatic validation, again the same approach is used as for the individual channel
decoder, a counter is embedded in the frame, replacing the adc values. Besides the counter,
the four lsb of each adc value are used to identify the half-sampa so that the content of
the output interface, presented in table 5.1, can be verified. An excerpt of the simulation
with ModelSim is shown in figure 6.1. The same signals are displayed as in the timing
diagram in figure 5.6, on top the used 240 MHz clock as a reference, below the input gbt
frames consisting of the 80 bit wide data field of the gbt protocol and the 32 bit wide fec
field with the additional data of the wide bus mode. The valid-flag is used to sample the
input data. The output signals are shown at the bottom, consisting of a valid-flag, an id
and the two adc data ports.
As can be seen, the valid signal is active for five out of six Clock Cycles (ccs), exactly
as expected. Second, the id increases by one with each rising edge of the valid signal. The
figure contains only the first three cycles since in a further zoomed-out version, the adc
95
Chapter 6 – Performance and Validation
values would no longer be visible. The id increases from 3’h0∗ to 3’h2 in the first three
cycles. This is continued in the simulation until 3’h7 is reached and then starts again with
3’h0. The content of the data ports are shown in 10 bit hexadecimal numbers, so the last
digit corresponds to the four lsb with the half-sampa number (di in table 5.1). It can
nicely be seen that this number increases with each cc in which the data is valid. The
other two digits show the transmitted counter which is the same for all five half-sampas.
Port 0 has always the even numbers and port 1 the odd numbers, both increasing with
each new gbt frame and therefore with each valid-cycle of the data output. All this is
checked in the automatic verification procedure and demonstrates the correct functionality
of the gbt decoder.
6.1.2 Sorting Algorithm
The approach which was used for the validation of the gbt decoder needs to be improved in
order to be used for the validation of the sorting algorithm, too. The general functionality
of the two modules, the pre-sorting together with the merging of the segments, needs to
be checked, as well as the configurations of the pre-sorting module. Those configurations,
which are 40 times the two read addresses in the correct order together with the control
command for the row-breaking, are mostly unique for each of the 182 half-fecs, only
20 randomly distributed configurations are by chance equal (e.g. the configuration for fec 5
in region 1 is equal to the one for fec 8 in region 3).
Since there are a lot of cases it would be quite difficult to write all checks to the test
bench file without errors. Therefore, a slightly different approach was used. A small gbt
frame generator was written as a fec emulator. It embeds either the channel number
or the sampa chip number or the fec id as the adc value into the frames, depending
on the setting. As a reminder, the sampa id and channel number are different for the
different regions as it was shown in table 5.1. Having this in place, together with the
already validated gbt decoder, an input stream with a defined sequence is generated.
The Pre-Sorter
To validate the functionality of the pre-sorter and the configurations, two runs are performed
for each individual configuration (for simplicity reasons the few duplicate configurations
are not handled separately but are just simulated again), one run with the frame generator
configured to send the sampa channels and a second one with the sampa chip number.
The fec id is not needed because this part of the data preparation path is still done for
each input link (and with that for each fec) individually. The resulting segment of the pad
plane is written to a file where it is checked in a second step with a small C-macro using the
O2 framework. As already mentioned, the correct mapping is part of the framework. The
written files look like it is shown in figure 6.2, on the left side with the channel numbers
at the pad positions and on the right side with the sampa chip id. By confirming that
all sampa channels from the individual sampa chips (therefore the two runs) are written
to the correct pad positions, the overall functionality and the configured mappings are
validated. Since this approach needs the O2 framework which was not yet installed on
∗Notation for the hexadecimal number 0x7 with 3 bit.
96
6.1 – Performance of the User Logic Modules in Simulation
1 18 16 17 19 21 0 0
2 24 22 20 23 25 0 0
3 28 26 27 29 31 0 0
4 2 0 30 1 3 0 0
5 8 6 4 5 7 0 0
6 12 10 9 11 13 0 0
7 18 16 14 15 17 19 0
8 22 20 21 23 25 0 0
9 28 26 24 27 29 0 0
10 2 0 30 31 1 3 0
11 8 6 4 5 7 9 0
12 12 10 11 13 15 0 0
13 18 16 14 17 19 21 0
14 24 22 20 23 25 27 0
15 30 28 26 29 31 0 0
(a) The sampa channels.
1 2 2 2 2 2 0 0
2 2 2 2 2 2 0 0
3 2 2 2 2 2 0 0
4 3 3 2 3 3 0 0
5 3 3 3 3 3 0 0
6 3 3 3 3 3 0 0
7 3 3 3 3 3 3 0
8 3 3 3 3 3 0 0
9 3 3 3 3 3 0 0
10 4 4 3 3 4 4 0
11 4 4 4 4 4 4 0
12 4 4 4 4 4 0 0
13 4 4 4 4 4 4 0
14 4 4 4 4 4 4 0
15 4 4 4 4 4 0 0
(b) The sampa chips.
Figure 6.2: The content of the pre-sorter mapping files for fec 0 in region 1. On the
left side with the channel numbers of the sampa chips and on the right side with the
sampa chip id. The row numbers are indicated by the line numbers on the left sides,
counting from the top down to the bottom (reversed order compared to e.g. figure 5.7).
Each row segment has seven elements, filled by default with a 0 if no other value is set.
With this information, each channel of a half-fec is uniquely identified.
the build server on which the automatic testing procedures are executed, some manual
interventions are needed. However, all necessary scripts and commands are included in the
repository and can easily be run on a properly prepared machine.
The Row-Segment Merger
For the confirmation of the merger module, the same concept is used. However, the
simulation must be extended substantially, since the merger is the first module (apart
from the Common Mode (cm) calculation) which combines the data from all input links.
Therefore, the simulation must contain twenty frame generators (this time with a different
fec id setting for each instance), one for each input link. Afterwards twenty gbt decoders
are needed and also twenty of the pre-sorter modules. This quite extensive setup is shown
in figure 6.3. Up to this point, everything is already validated and can therefore be used.
Since there is only one merger module in each cru, the mapping must be validated for each
of the ten regions separately. Within each region, the frame generators must be configured
with the right fec id and the single pre-sorter instances need to be configured individually
for the respective location. Having this, the simulation must run three times, one for each
setting of the frame generator to be able to write the corresponding pad planes of the
individual regions with the fec id, with the sampa chip and the sampa channel to a file.
With those three informations the pad position is unambiguously determined and with that
is the overall mapping procedure validated. This is again done with an external C-macro
using the mapping information from the O2 framework.
The reason why one has to run the simulation multiple times is because those three
numbers do not fit into a single 10 bit value. The fec id is a number between 0 and 19
97
Chapter 6 – Performance and Validation
row-segment
merger
pre-sorter 0
pre-sorter 19
GBT dec. 0
GBT dec. 19
frame gen. 0
frame gen. 19
..
.20×
Figure 6.3: Setup of the row-segment merger simulation test bench. To fully qualify the
merger, data for all the twenty input ports must be generated. To do this in a convenient
manner, twenty frame generators are instantiated, together with twenty gbt decoders
and twenty pre-sorters. The whole test bench consists of 61 modules.
(5 bit), the sampa chip a number between 0 and 4 (3 bit) and the sampa channel a number
between 0 and 31 (5 bit). One could argue that at least the chip and channel numbers
would fit into a single 10 bit value, which would reduce the required number of simulation
runs for the pre-sorter. However, due to the additional fec id information needed to
validate the merger module, the simulation of the merger module must be done multiple
times. Having this in mind, it is not worthwhile to make the simulation more complex
than absolutely necessary, always with the danger of remaining undiscovered errors, just
to make it a bit more convenient.
6.1.3 Baseline Correction
The validation of the blc module is straight forward. An input stream of randomly
chosen values is generated on which all combinations of the three operations (additive
pedestal subtraction and cm correction and the multiplicative gain correction) are applied.
The individual corrections are enabled and disabled after each other so that all eight
combinations are tested, as can be seen in the config field of figure 6.4. The result of the
module is then compared to values which are calculated directly with the corresponding
operators in vhdl. With that, the module is validated. This is also part of the automatic
verification procedure.
6.1.4 The individual Cluster Reconstruction Modules
Since the cfs and the Cluster Processors (cps) are the heart of the ul, great attention
was given to the simulation of these modules. To be sure that the reconstructed clusters
really correspond to the data, the testing procedure consists of several steps. First, a single
cf instance is validated, it is checked that a single cp is calculating the cluster properties
accurately and that the fifo merger is able to combine the input streams to a single output
stream. In a second step, the cf and cp modules are combined with increasing complexity.
It is startet with one module of each to ensure that the protocol between them is working.
Then the output of two cfs is processed by one cp which is then increased to five cf
98
6.1 – Performance of the User Logic Modules in Simulation
Figure 6.4: Simulation of the blc module with ModelSim. Each combination of the three
corrections is applied one after the other, verifying that all calculations are done properly.
instances, one more than in the final system to stress-test the cp. The next logical step
is to increase the number of cps to two (and with that the number of cfs to ten) which
makes it possible to place a fifo merger at the end. This setup is then further extended
to 23 cf instances, which corresponds to a complete pad row with 138 pads, five cps and a
single fifo merger. As a last step, the full clusteriser network is simulated, consisting of all
the 275 cfs, 69 cps and the two merging stages in the end, for all ten region configurations.
The Cluster Finder
The pattern which is fed to the cf instance contains several peaks to cover all possible
cases. There are peaks on both sides outside of the central region to check that those
peaks are ignored and not detected by this cf, they will be found by the neighbouring
one. There are isolated clusters and clusters which are nearby to validate that those are
correctly found as well. Also, a peak with a low adc value is added to the test data to
check that the corresponding threshold is taken into account to suppress small peaks. The
contribution threshold is tested as well by introducing some small numbers next to the
central peak in all directions. An additional feature which helps to verify the functionality
is that the data sample has an odd number of time bins. By using this sample twice,
shifted by one cycle due to the odd number of entries, it can be checked whether both of
the two time bins that are processed together in the cf are treated equally.
Also, the output interface is validated since it is used to check if each individual bin of the
found clusters is provided in the correct order by the cf. Finally, the content of the header
fields (see table 5.5) are controlled as well. The configured row number must be transmitted
in the field H0. The correct pad number, including an offset which is calculated with the
configured cf number, must be contained in H1. The last header field H2 must contain
the correct time bin of the peak, taking also several Heart Beat (hb) reset triggers into
account. Since the additional flags are not yet used, it is ensured that they remain at zero.
In summary, every detail of the cf is checked during the automatic validation procedure.
The Cluster Processor
The validation of the cp must cover two aspects, firstly the correct calculation of the cluster
properties and secondly the correct handling of the input buses. The latter includes the
99
Chapter 6 – Performance and Validation
selection of the next, non-empty input after the current processing has been finished and
the recognition of the reset pattern with subsequent masking of the input bus.
The test for the first case is straight forward. Some predefined cluster patterns are
written to the input fifos of the cp and the result is then compared to pre-calculated
expected values. The row number which is part of the cluster pattern but not involved in
the actual calculations (it is just passed through the cp) is used in the end to identify the
cluster. This set of clusters is then just reused during the whole simulation. It is not an
issue that the number of tested clusters is strongly limited because the actual calculations
take place in the Digital Signal Processor (dsp) blocks of the fpga, which are supposed
to work. If this assumption would be invalid, then this fpga would not be suited for our
purpose. But the processed clusters are still important to validate the bit-shift operations
for the δxi multiplications (see e.g. figure 5.23a) and the correct accumulation.
To be as close as possible to the expected environment, the input fifos are not filled
evenly, but randomly with different priorities. This requires an internal mechanism to
prefer a fifo which is almost full for the next reading cycle. With this it is ensured that a
single fifo does not run full while the others are almost empty. In addition, several hb
reset cluster patterns are sent in-between to test the correct forwarding of the reset and
the generation of the reset cluster. All of this is part of the automatic testing.
The FIFO Merger
The merger is a rather simple module. The automatic validation is done by feeding a
predefined pattern to all input ports. The output port is then checked for this pattern. In
addition, also the reset cluster detection and forwarding mechanism is tested.
6.1.5 The complete Clusteriser
Before the entire clusteriser is simulated, small subsets are built for a better overview. A
system with hundreds of modules involved, thousands of interconnections and even more
internal paths is quite challenging to debug. Therefore, smaller systems with less cfs and
cps are built first for a basic validation. The smallest system consists of one cf connected
to just one cp. This system is then enlarged in steps to two cfs and five cfs, each with
one cp to include the merging capabilities of the cp in the testing procedure. The next
step is then to increase the number of cps to two and finally to five — and with that the
number of cfs to 10 and 23, respectively — to cover a full pad row in the biggest system.
The latter two need also a fifo merger in the end for a synchronous readout of the cps.
Starting with smaller Systems
So far, only a basic validation of the individual modules is done: it was shown that the cf
is able to find all peaks in the central region and writes out the correct bins. It was also
shown that the cp calculates the properties of a cluster correctly. But more data is needed
to fully verify both modules. Therefore, test datasets with random clusters are generated.
These clusters are 2-dimensional Gaussian distributions with uniformly distributed centres
in pad and time direction within the valid regions of the individual systems. To have cluster
shapes similar to the ones expected in the real system, equation 5.6 and equation 5.7 are
100
6.1 – Performance of the User Logic Modules in Simulation
used to calculate the widths in both directions. A random radius within the coverage of the
tpc is chosen which is used to select the correct pad sizes and to calculate the inclination
angle for σt. The additionally needed drift length is randomly chosen as well between 30
and 250 cm. The value for σp is expressed in units of pad widths and σt in units of time
bins with a length of 200 ns, corresponding to the sampling frequency of 5 MHz of the
tpc. The normalisation of the Gaussian function, which corresponds to the qmax value,
is randomly drawn from a Landau distribution. This parameterised function is sampled
a thousand times to fill a histogram to emulate the binning effect of the detector. This
histogram is then scaled to the desired qmax value and added to the dataset. For sufficient
statistics, 100 000 time bins are filled with six different occupancy levels of 1, 5, 10, 20, 30
and 40 %. The occupancy is defined as the ratio of bins with an adc value above zero
divided by the total number of bins of the generated data set. The 100 000 time bins
correspond to d100 000/447e = 224 hb frames. In this way, also the synchronisation is
extensively tested within the simulation. Such datasets are generated individually for each
of the clusteriser systems, as the pad area covered is different and ranges from 6 pads for
the smallest system to 138 pads for the largest one. They contain between ∼300 clusters in
the dataset for one cf and an occupancy of 1 % and ∼300k clusters in the dataset for 23
cfs and an occupancy of 40 %. In total, summing over all systems and occupancies, more
than 1.7M clusters were generated of which more than 1.38M clusters could be found and
were reconstructed. This difference between the number of generated and found clusters
is not an issue because its origin are the overlapping clusters. There are 1.7M clusters
generated with a random placement, so it is possible and for the high occupancy case rather
likely that they overlap. If then the peak of two clusters is placed at the same pad–time
coordinate, or next to each other, only one can be found instead of the two generated ones.
The validation is done in two ways. First the resulting distributions of the cluster
properties qmax, qtot, pad and time information, as well as the corresponding widths are
compared to the ones of the generated clusters. Second, these distributions are compared
to the results of a software implementation of the cf, which will be discussed in section 6.3.
The same input files are also used for the software version. Those comparisons can be
seen in figure 6.5 for the system with 23 cfs and an occupancy of 1 % and in figure 6.6
for an occupancy of 30 %. In both occupancy cases, a cut on the qmax value of the cluster
of adc > 86 and on the contributing bins of adc > 5 is applied to test this mechanism
as well. The 23 cf case was chosen because the statistics is the highest compared to the
other systems, simply due to the higher pad coverage.
The black line is always the reference distribution of the randomly generated parameters.
The distributions obtained in the vhdl simulation is shown in red and the green lines are
the corresponding ones from the software implementation of the algorithm. Those two
are not only very similar, they are indeed identical. For each individual cluster it was
checked that the one found in software is bit-accurate with the one obtained from the
vhdl simulation. It is noteworthy that the algorithm, written in two completely different
languages — one in vhdl and the other in modern C++ — which are executed in two
completely different ways (with ModelSim and with the O2 framework), without a single
deviating bit leads to exactly the same result. This means that for further studies of the cf
performance, one can rely on the software implementation which is running substantially
faster. For a comparison, those ModelSim simulations run between 30 min for the smallest
101
Chapter 6 – Performance and Validation
0 100 200 300 400 500 600
ADC value
1
10
210
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 1%
(a) Distribution of qmax of the clusters.
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
310×
ADC value
1
10
210
310
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 1%
(b) Distribution of qtot of the clusters.
0 20 40 60 80 100 120
 pad
20
40
60
80
100
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 1%
(c) Distribution of the cluster centres in pad
direction.
0 10 20 30 40 50 60 70 80 90 100
310×
time bin
20
40
60
80
100
120
140
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 1%
(d) Distribution of the cluster centres in
time direction.
0.2 0.4 0.6 0.8 1.0 1.2 1.4
pσ
50
100
150
200
250
300
 
co
u
n
ts reference
ref. rebinned
VHDL CF
C++ CF
Occupancy: 1%
(e) Width of the clusters in pad direction.
0.2 0.4 0.6 0.8 1.0 1.2 1.4
tσ
0.2
0.4
0.6
0.8
1.0
1.2
310×
 
co
u
n
ts reference
ref. rebinned
VHDL CF
C++ CF
Occupancy: 1%
(f) Width of the clusters in time direction.
Figure 6.5: Distributions of the cluster properties for the simulation system with 23 cf
instances, an occupancy of 1 % and a cut on qmax of adc > 86 and the contributions
threshold of adc > 5. The reference of the data input is shown, as well as the result of
the vhdl implementation in red and the software implementation in green.
102
6.1 – Performance of the User Logic Modules in Simulation
0 100 200 300 400 500 600
ADC value
1
10
210
310
410
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 30%
(a) Distribution of qmax of the clusters.
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
310×
ADC value
10
210
310
410
510
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 30%
(b) Distribution of qtot of the clusters.
0 20 40 60 80 100 120
 pad
0.5
1.0
1.5
2.0
2.5
3.0
310×
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 30%
(c) Distribution of the cluster centres in pad
direction.
0 10 20 30 40 50 60 70 80 90 100
310×
time bin
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
310×
 
co
u
n
ts reference
VHDL CF
C++ CF
Occupancy: 30%
(d) Distribution of the cluster centres in
time direction.
0.2 0.4 0.6 0.8 1.0 1.2 1.4
pσ
2
4
6
8
10
310×
 
co
u
n
ts reference
ref. rebinned
VHDL CF
C++ CF
Occupancy: 30%
(e) Width of the clusters in pad direction.
0.2 0.4 0.6 0.8 1.0 1.2 1.4
tσ
5
10
15
20
25
30
35
40
45
310×
 
co
u
n
ts reference
ref. rebinned
VHDL CF
C++ CF
Occupancy: 30%
(f) Width of the clusters in time direction.
Figure 6.6: Same as in figure 6.5 but with an occupancy of 30 % instead of 1 %.
103
Chapter 6 – Performance and Validation
0 5 10 15 20 25 30 35 40
occupancy (%)
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
#r
ec
. c
lu
st
er
 / 
#g
en
. c
lu
st
er
w/o cut
w/ cut on ADC
Figure 6.7: The efficiency of the clusteriser as a function of the occupancy. The efficiency
is the number of reconstructed clusters divided by the number of generated clusters. The
orange curve shows the efficiency without the cut on the adc value, so the reduction is
purely due to the overlapping of two cluster peaks, while the blue one includes a cut on
the adc value of the peak in addition.
system and 2 h for the system with 23 cfs to process the 100 000 time bins of the dataset,
while the software version is done within a few seconds.
In general, all distributions of the cf results correspond very well to those of the
references. In the two plots with the adc value, the cf results follows very closely the
reference distribution, at least for the low occupancy case. For the high occupancy one, an
excess at higher adc values is clearly visible. This is even more pronounced in the qtot
distribution. However, this is an expected effect since at higher occupancies the probability
of overlapping clusters is also higher. This will lead to a higher qmax value because the
tail of a neighbouring cluster contributes also to the maximum value. Since in the current
implementation the qtot value is calculated by summing always over all the 25 bins of the
cluster (except for the exclusion via the contribution threshold), it is expected that the
effect is more visible in this distribution. In the plot of the qmax value, the cut at a low
adc value of 86 is clearly visible, which was enabled to validate also this mechanism. Since
small clusters are rejected, this contributes also the the reduced number of entries at low
adc values in the plot of the qtot values.
The distributions of the cluster centre in pad and time direction in figures (c) and (d)
of figure 6.5 and figure 6.6 are nicely flat as they are supposed to be. The reduction in
entries, comparing the red and green lines with the black one, is due to two effects. The
first one strongly depends on the occupancy and is simply the overlapping of two or more
clusters. When two generated peaks end up in the same or a neighbouring bin of the
pad–time plane, then only one cluster can be found although two were generated. With an
increasing occupancy, the probability of such a positioning increases as well. The reduction
in efficiency due to this effect is shown by the orange curve in figure 6.7. The efficiency
is defined as the number of reconstructed clusters divided by the number of generated
ones. The number of clusters correspond to the entries of the qmax distributions of the
respective datasets. It can be seen that for a very low occupancy the efficiency is close to
104
6.1 – Performance of the User Logic Modules in Simulation
one, meaning that all generated clusters are also found. The second effect is the additional
cut on the qmax value to reject small clusters. By including this cut, the blue curve is
obtained which is overall lower than the orange one, due to the additional loss of clusters.
The last set of plots in figure 6.5 and figure 6.6 shows the σx distributions in pad and
time direction. In the low occupancy case (figure 6.5e and figure 6.5f), the shape is quite
well reproduced but a shift towards higher values is visible. This shift can be explained by
the binning effect of the data generation. The same effect was observed by calculating the
standard deviation of the weighted mean directly from the individual cluster histograms
after the binning of the original cluster function, for both pad and time directions. The
resulting distributions are shown in blue, which display exactly this shift. So this effect
is expected. Comparing the blue curve to the red and green ones, only the reduction in
entries is visible, the shape is well reproduced. For the high occupancy case in figure 6.6,
there are many more entries for higher σ values, but also this is expected. Again, in a high
occupancy environment, there are many clusters close to others and they do even overlap.
Since there is no way to perfectly separate the corresponding distributions from each other
in the cru, the resolution must get worse, which is the visible effect.
To summarise, the clusteriser is behaving as expected. The individual distributions
are reconstructed very well and no border effects due to the segmentation of the pad
row is visible. Also, the calculation of the cluster properties works as expected. Since
an automatic classification of those distributions is not straight forward, some manual
intervention is still needed to qualify the result. This, together with the very long simulation
time prevents the inclusion of this validation into the automatic testing procedure and
must be done manually.
Validation of the whole Clusteriser Network
After the basic concept of the individual cf instances is validated and the processing and
merging by the cps and fifo mergers proven to be working, the full clusteriser with all its
275 cf, the 69 cp and the two merging stages in the end can be simulated. The main purpose
of this additional step is to validate the optimisation of the cf grid as it was discussed
in subsection 5.3.6. It must be proven that the reusing of the individual cf instances for
different positions in different regions does not introduce errors in the mapping. Also, the
performance of the merging network needs to be tested to show that at reasonable and worst
case data rates, no back pressure is generated on the cf instances. For a proper simulation
of the latter one, the clusters need to be correlated. In the real detector, most of the clusters
originate from a track, crossing several (in the ideal case all) pad rows. So within a short
time interval, there are many clusters in different rows which arrive in the merging network
closely in time, leading to spikes in the required bandwidth. Since real data, taken with the
detector, is unavailable at the moment because the upgraded tpc does not yet exists, the
O2 framework is used to generate simulated input data for the validation procedure. In this
framework, the detector response is simulated. The charges, generated by tracks within
the active area of the detector, are transported to the pad plane where they are digitised.
This includes the simulation of the amplification of the gem stack and the electronics
response. This software simulation is tuned that the signals are as close as possible to
the ones expected from the real detector. The so called digits can then be extracted from
105
Chapter 6 – Performance and Validation
the simulation and used to feed the vhdl simulation. In addition, those digits are further
processed within the O2 framework to find the clusters and compute the properties also in
software. This is then used for a comparison with the vhdl clusteriser network response.
In total, around 12k time bins were generated in which 0.97M clusters were found. To
reduce the computational effort, only one sector of the tpc was taken into account, which
is the minimum that must be simulated to cover all ten different regions. The regions
are simulated separately with ModelSim to test all the ten different configurations of the
modules. The test bench instantiates the clusteriser module, takes care of the individual
configurations and provides the input data in the correct format. After the simulation
is done, the clusters are compared with the result of the software cf. Again, in almost
a million clusters, not a single deviating bit was found. Exactly the same clusters were
found and exactly the same properties were calculated in both, the software version and
the vhdl implementation for the cru. Even with rather challenging simulation parameters.
The occupancy distribution of the dataset is shown in figure 6.8c. The number of occupied
pads, meaning the pads with signal above zero, was counted for each row individually in
each time bin. Then by dividing by the number of pads in the specific row, the occupancy
was calculated and filled into the histogram. As can be seen, the occupancy reaches up
to 60 % although a maximum occupancy of only 30 % is expected for the tpc after the
upgrade. So this simulation can be considered as a kind of worst case environment. Figure
6.8b shows the number of clusters, found in each row–pad combination, integrated over
all simulated time bins. The individual regions are marked with a black box. It must be
noted that the pad width changes between the regions. This becomes particularly visible
when going from region 3 to 4 where the transition from iroc to Outer Readout Chamber
(oroc) takes place. A row then covers a larger area with fewer pads because the pad width
is larger. The pad numbers are centred around zero, which means that the pad located in
the middle of each row is assigned the pad number zero for a better symmetry visibility.
The main message from this plot is that each pad of each region was hit at least once
during the simulation. This means that, since all clusters from the ModelSim simulation
could uniquely be matched to a cluster in the software simulation — where this reusing of
cf instances is not needed and therefore not implemented — and vice versa, no error in
the mapping is introduced and therefore the clusteriser is completely validated.
There are just two additional details to be mentioned. One is the high number of clusters
on the leftmost pad of each row in the inner regions. There is indeed a huge excess in
number of clusters in this leftmost pad. Compared to the other ones almost twice as many
clusters are found in the worst case. As can be seen in figure 6.8a, which shows the same
plot but for the digits used as input for the two clusterisers, there is an excess in the same
pads, too. The other detail to be mentioned is the existence of the two white spots in row
125 and 126. There are indeed four consecutive empty pads in both rows. Here again,
the same white spots are visible at exactly the same positions in the input data. The
observation was communicated to the corresponding developers, who could not find out
the reason in time for the submission of this thesis. However, since the same spots are
reproduced by the clusteriser, the confidence in the correct functionality of the cf and cp
is even greater.
The last item to be checked is the behaviour of the fifo mergers under this condition.
The filling level of the four final fifos of region six (the innermost one with 1 600 pads) is
106
6.1 – Performance of the User Logic Modules in Simulation
0
100
200
300
400
500
600
700
800
#d
ig
its
60− 40− 20− 0 20 40 60
 pad
0
20
40
60
80
100
120
140
160
 
ro
w
(a) Number of input digits per pad and row.
0
20
40
60
80
100
120
#c
lu
st
er
s
60− 40− 20− 0 20 40 60
 pad
0
20
40
60
80
100
120
140
160
 
ro
w
(b) Number of found clusters per pad and row.
0 10 20 30 40 50 60 70 80 90 100
 occupancy (%)
1
10
210
310
410
510
610
co
u
n
ts
(c) Simulated occupancy distribution of the individual rows.
Figure 6.8: Input and results of the full clusteriser simulation.
107
Chapter 6 – Performance and Validation
0 100 200 300 400 500
310×
clock cycle
100
200
300
400
500
u
se
d 
de
pt
h FIFO 0
FIFO 1
FIFO 2
FIFO 3
Figure 6.9: Filling level of the four final fifo mergers as a function of time during the
simulation. The fifos have a depth of 512 entries.
shown in figure 6.9. It can nicely be seen how all four fifo levels build up whenever there
was a new wave of tracks. Afterwards, they are emptied and filled again. Having a fifo in
such a position raises always the question about the possibility of a buffer overflow. If those
final fifos would run full, there are more (but smaller) fifos in the merging stage 1, which
can also buffer part of the load. If those run full as well, there are still the output fifos of
the cps and, in addition, the output fifos of the cfs. So there is a cascade of fifos which
are able to buffer a significant amount of clusters even in the worst case scenario. Only in
the very rare case when all fifos are full, clusters are thrown away in a controlled manner
to not disturb the processing chain.
This validation step must also be carried out by hand. First of all, the evaluation of the
results is not easily automatised, and second, the simulation time is just too long. Each of
the ten regions need up to 10 h to simulate the processing of the 12k time bins (with the
software implementation, this is done for all regions combined within a few seconds). So if
this would be done after each commit, the build server would be busy the whole time with
this simulation and would not be available for its real purpose which is the compilation of
the fw.
6.2 Validation during Test Beam Data Taking
The decoding modules of the ul could be tested in a real system during the time of
this thesis. There was a test beam campaign in May 2017 at the cern ps to test one
preproduction iroc and the new fecs. The ideal case would be to also use a cru
together with all the O2 machinery, meaning a First Level Processor (flp) and the
O2 framework for the reconstruction and processing, to read out the system. Then,
the overall test would be as close as possible to the final setting of the experiment.
108
6.2 – Validation during Test Beam Data Taking
2015 JINST 10 C02022
3x QSFP:
12 fast serial links
connected to FPGA
transceivers (GTX)
Up to 6.6 Gbps per
channel
RJ45 Slot:
4 LVDS pairs, 
up to 800 Mbps. 
Not for Ethernet.
PCIe Gen2, 8 Lanes 
8x 5.0 Gbps, connected to
Xilinx PCIe Hard Block
2x DDR3 SO-DIMM
Two independent SO-DIMMs
Up to 1066 Mbps with Xilinx
MIG DDR3 Softcore
FMC LPC
VITA57.1 FMC 
LPC Connector
SMA Connectors 
for additional GTX 
and System Clocks
Conﬁguration Flash
2x 128Mbit Xilinx 
PlatformFlash XL
Microcontroller for 
Conﬁguration monitoring
and connection to Host via 
PCIe SMBus Lines.
GTX RefClk
Conﬁgurable GTX 
Reference Clock
System Clock
200 MHzBi-Color LED
in IO-Bracket
SDCard Socket
JTAG-Connection
for FPGA Programming
Cables
Power Connection
via 6-Pin PCIe GPU
Power Connector
Figure 3. Photo of the C-RORC board with the major components and features annotated.
0
500
1000
1500
2000
2500
3000
3500
4000
100 1000 10k 100k 1M
Tr
an
sf
er
 R
at
e 
[M
B
/s
]
EventBuffer payload throughput
Event Fragment Size [bytes]
EventBuffer + ReportBuffer payload throughput
EventBuffer + ReportBuffer throughput incl. TRN headers
Figure 4. C-RORC DMA-to-host throughput.
development environment for the on-board processor no longer have to be maintained. This new
version of the ROBIN is known as the RobinNP, “NP” refers to “No Processor”. The custom board
developed by the ALICE collaboration, the C-RORC, described in the next section, provides all
functionality required for the RobinNP, as discussed in section 3.3.
2 The Common Read-Out Receiver Card (C-RORC)
The lack of suitable commercial platforms to replace the Run 1 Read-Out Receiver Cards deployed
in ALICE led to the development of a custom board. Even though the development was driven by
ALICE requirements, the target platform was kept as generic as possible. A photo of the final board
with the major components annotated is shown in figure 3. The board is a full-width, full-height
PCIe card according to the PCIe specification. The height of the components is kept within the
specification to allow installation of boards into adjacent PCIe slots. The boards are powered from
6-pin GPU power cables.
The central component on the board is a Xilinx Virtex-6 FPGA. This FPGA already comes
with a PCIe hard block for up to eight lane PCIe generation 2 (8x 5.0 Gbps). A measurement
of the usable PCIe bandwidth with a maximum payload size of 256 byte per PCIe packet on a
– 4 –
Figure 6.10: Image of the c- orc wit its major components, taken from [61].
But there are some arguments speaking against it. First of all, the main purpose of
the campaign was to test the roc and not the subsequent readout chain. The fecs
are mandatory to be able to read out the chamber, but adding there another proto-
type hardware — the cru — would introduce another source of error and could lead
to an overall failure of the operation. Especially because the cru hardware was at
that time still far from the final version and just a prototype. Even more, the fw
was also not ready in time, not the common parts and only a few modules of the
tpc specific ul. herefore, another solution had to be developed which was found
by reusing the Common Read-Out Receiver Card (c-rorc), which is an already well
established fpga based readout card, currently in use in the Data Acquisition (daq)
and the High Level Trigger (hlt) systems of alic . This card has some disadvantages
compared to the cru, especially concerning the available b ndwidth and number of
possible links, but has the huge advantage that the hardware is proven to be work-
ing.
6.2.1 The C-RORC as Readout Card
Since the c u was not available, the readout system for the test beam was built with a
c-rorc. This is a pcie card with a Xilinx Virtex-6 fpga as the central component. The
board offers a pcie generation 2 connection to the host machine with a bandwidth of up to
3.7 GB/s. The three Quad Small Form-factor Pluggable (qsfp) modules each provide four
high speed optical links with up to 6.6 Gbit/s per channel [61]. A photo of the board is
shown in figure 6.10. In collaboration with the original developer of this card, a fw for
the test beam was built, reusing the existing Direct Memory Access (dma) engine and the
available software tools. The fpga has even enough resources that twelve gbtx cores could
be implemented, which opens the possibility to make use of all available optical connections.
With that, six fecs can be controlled and read out by a single c-rorc. Unfortunately, it is
not possible to add more than one c-r rc to the readou system (and with that not more
109
Chapter 6 – Performance and Validation
than six fecs). All the fecs need to receive a synchronous clock which must be provided
by the c-rorc. Having here more than one in the system would require an external clock
source synchronising the readout cards. But the transceivers of the fpga can only use
internally generated clocks and not externaly provided ones. This is why multiple c-rorcs
would not be synchronous.
The input data rate of all twelve links sums up to 12×112 bit×40 MHz = 53.760 Gbit/s,
which is already almost twice as high as the available pcie bandwidth. Since no Zero
Suppression (zs) can be applied to the data (again because of the cm effect), a continuous
readout is not possible and a triggered one must be implemented in which the data is
transferred only when particles crossed the chamber. A triggered readout is anyhow better
suited for the environment of the ps test beam since the interaction rate is rather low with
only a few hundred Hz.
6.2.2 The Triggered Readout Mode
The future fec of the tpc is designed for a continuous readout, not a triggered one.
Therefore, the fec will send a continuous data stream also in this case and the triggering
must be applied later in the c-rorc. Here, a readout gate is opened upon arrival of the
trigger signal, allowing to send a configurable number of gbt frames to the host machine. A
header is prepended to each chunk of data, containing a time stamp. With this time stamp
and the synchronisation pattern, which must be contained in the very first chunk of data,
the gbt frames can be decoded in software even though big portions of the continuous
stream are missing. By correlating the time stamp of the first trigger sample with the
frame number in which the synchronisation pattern is found, it can be calculated in which
subsequent frame numbers the first channel of the sampa readout is located. With this it
is possible to decode the data also in the triggered readout mode.
6.2.3 Validation of the GBT Decoder
Three different readout modes were implemented for the test beam. In the first one, only
raw gbt frames are read out. The second one provides the already decoded adc values,
together with the id generated by the gbt decoder. In the third mode, the two modes
are combined and raw frames together with the decoded values are read out. The last one
is used for the validation of the gbt decoder module. During the test beam campaign,
eight runs were taken in this mode with more than 120k events. This corresponds to more
than 4.4 billion individual adc values which could be compared. The spectrum of those
values is shown in figure 6.11. The green line shows the adc values which were decoded
in software afterwards, while the red one corresponds to the adc values already decoded
in the c-rorc. The two histograms are identical, which can also be seen from the ratio
below, that is equal to unity over the entire range of adc values from 0 to 1023. There
are some entries at zero and an increase at an adc count of ∼500 which belongs to dead
channels of the sampa chips. The origin of the other visible structures are also understood
by non-uniformities of the sampa adcs. This leads to the fact that even and odd numbers
did not occur with the same probability. This issue was fixed in a later version of the
110
6.3 – Cluster Reconstruction in Software
0 200 400 600 800 1000
 ADC value
1−10
10
310
510
710
910
co
un
ts RAW GBT frames
Pre-decoded data
0 200 400 600 800 1000
 ADC value
0.995
1.000
1.005
ra
tio
 ra
w/
de
c
Figure 6.11: Spectrum of the adc values recorded at the test beam. The green line
shows the values from the gbt frames which were decoded in software and the red ones
which were decoded already in the c-rorc. The orange line below shows the ratio of the
two histograms, which is equal to unity over the entire range.
sampa chip. The equality of the two histograms proof again that the gbt decoder is
working, also in a real fpga.
6.3 Cluster Reconstruction in Software
The same cluster reconstruction algorithm, or at least one which is as similar as possible, was
also implemented in modern C++ within the alice O2 (Online-Oﬄine) software framework.
In this framework, all functionalities needed for the alice experiment are combined. This
includes software for the readout of the detectors, the event building, the recording of the
data, calibration tasks, reconstruction of the data as well as for the physics simulation and
analysis [47]. Since the purpose of the simulation is to generate data that is as similar as
possible to the real recorded ones, also the same reconstruction steps have to be applied.
For the tpc, the cluster reconstruction is done already in the cru during the normal
data taking mode. So there must be a software version of the clusteriser available for the
processing of the simulated data which provides exactly the same results. Therefore, also
the same two-step approach was implemented, first going through the data and findig the
peaks according to the definition in figure 5.15 and in a second step calculating the same
properties as written in equation 5.26. In order to guarantee that the calculations deliver
the same results, the Fixed-Point (fp) arithmetic is also used in software.
Though, there are two basic differences between the C++ version and the one written
in vhdl due to the slightly different applications and due to the characteristics of the
respective languages. First, the separation into small cf instances is not implemented in
software. This was introduced for the fpga version to reduce the fan-out of the individual
paths. Because of the overlap needed in between in individual instances, this leads to a
higher computing effort which can not be justified for the software version. Here, besides
111
Chapter 6 – Performance and Validation
Figure 6.12: The Callgrind [62] output of the software cf, visualised with the tool
KCachegrind [63]. The blue rectangles show the share of computing time which was used
by the cf. As can be seen, although they are the largest contiguous blocks, they use only
a small fraction of about 20 % of the overall time.
the equality of the result, also the needed computing time is an important factor. So there
are no borders within a row introduced in contrast to the vhdl version. Second, due to
the implemented parallelism scheme of the O2 framework, each sector of the tpc is treated
by a different computing process. So it does make sense to handle a whole sector, meaning
all the 152 rows of the ten regions combined, by one clusteriser. This reduces the overhead
of the handling of the data and makes the execution more straight forward. Since there
are no individual small cf instances which are even reused at other locations in different
regions as it is done in the vhdl version, the mapping is much easier to implement and
to debug.
The overall correctness of the implementation is crosschecked in subsection 6.1.4 and
6.1.5 with the results of the vhdl implementation. The assumption is that if both
implementations give the same result, although they are written in completely different
languages and run in completely different ways, then the algorithm must have been
implemented in a correct way in both cases. In addition, there are small test cases written
with known clusters where the output is automatically checked. Also, a few checks were
done with a tracking algorithm working with the clusters from the software implementation.
However, extensive tests which proof the physics performance of the clusteriser are not yet
completely done. They are planed for the future when the framework is evolved far enough
that this can be done in a comprehensive way. But the algorithm itself is well justified
and, as can be seen in the corresponding sections, implemented in a correct way.
Since the software clusteriser is used besides the validation of the vhdl modules mainly
for the reconstruction of the simulation data, the performance is quite an important factor,
together with the memory footprint. To reduce the memory footprint, the locally buffered
112
6.3 – Cluster Reconstruction in Software
data which is needed for the cluster finder is reduced to a minimum. Similar to the vhdl
version, only six time bins are needed at the same time in the local buffer, five which
contain the full cluster and a sixth one so that the comparison operation of the individual
pads for the peak finding is finished also for the last of the five time bins. The eight needed
comparisons for each pad are done not at the same time: four are done when the current
pad is the centre one of the algorithm, the other four relations during the processing of the
next time bin, when the corresponding other pads are the centre ones. In this way, the
data of only six time bins needs to be expanded. This means, since the input data to the
clusteriser is just the signals, noise and the baseline are added within the clusteriser to the
signal pads and to the empty pads. In this way, a lot of disk space can be saved by not
storing noise but only the real signals. To optimise the execution time, the tool Callgrind of
the Valgrind framework [62] was used throughout the development and debugging process
to profile the code. With this tool, the individual function calls together with a counter,
how often the functions were called, and the corresponding execution time can be visualised.
This allows to find the best place for optimisations and helps to remove unnecessary calls.
Such a visualisation of the function calls is shown in figure 6.12. As can be seen, there
are many small contributions to the overall execution time. The two big blocks, which are
shown in blue, correspond to the clusteriser. The two main functions are the peak finder
and a function to calculate the cluster properties, contributing 6.74 % and 8.09 % to the
total time. Adding all parts of the clusteriser, one finds that only 19.95 % of the total time
is used by the cluster finder and processing algorithm. Everything else is needed for the
surrounding framework which takes also care of reading the input data from and storing to
the disk.
To achieve this execution time, also Single Instruction, Multiple Data (simd) vectorisation
was used. This is another kind of parallelism where the same instruction is performed on
multiple data points simultaneously. Most of the modern cpus provide simd instructions
to improve the performance. This technique is applicable in the case of the clusteriser
because always the same instructions need to be applied. For example, the computation of
the relation between all neighbouring pads is always the same. So by combining the data
of e.g. multiple rows within a simd vector, those rows can be processed simultaneously,
improving the processing time significantly.
113

Chapter 7
Conclusion and Outline
This thesis was accomplished in the scope of the alice tpc readout upgrade towards
a continuous readout with a total data rate of 3.7 TB/s. The upgrade is necessary in
preparation for the lhc Run 3, starting in 2021, where an interaction rate of 50 kHz in PbPb
collisions is expected. The possible application of a Huffman encoded differential detector
readout was studied. It was shown that with the length-limited Huffman encoding scheme,
which is well suited for an implementation in a Front-End Device (fed), a sufficiently high
compression factor is achieved. The compression factor is better than 2.5 over a large
detector occupancy range of up to 40 % but degrades as soon as the noise contribution
in the signal increases. With such a behaviour, no reliable detector operation is possible.
This finding contributed to the review and a modification of the readout scheme presented
in the original tpc upgrade Technical Design Report (tdr) towards an uncompressed raw
data readout.
The main topic of this thesis was the development and implementation of the online
Cluster Finder (cf) in vhdl. The same algorithm has also been implemented bit-accurate
in modern C++. This additional implementation in the alice O2 framework is required for
verification purposes of the algorithm itself since it provides several orders of magnitude
faster processing times compared to the vhdl simulation. It will also be used for the general
reconstruction of simulated alice data. This 2-dimensional hardware cf is an essential
preprocessing step in the future readout chain to enable an efficient physics data taking for
the tpc. It runs on the fpgas of the Common Readout Units (crus) and will inspect the
whole data volume of the tpc in real-time already during the readout. Furthermore, all the
necessary data preparation steps which need to be performed beforehand were also designed
and implemented. This includes the decoding of the input data, a configurable sorting
algorithm to map the up to 1 600 individual input channels of a cru to the corresponding
positions on a 2-dimensional grid in a flexible way and the Baseline Correction (blc). This
blc includes the pedestal subtraction, the correction of the Common Mode (cm) effect as
well as a multiplicative gain correction for each pad separately.
Each single module was simulated in detail to verify the proper functionality. Since the
developments for such a central part of the readout chain are continuously being advanced,
an automatic verification procedure was implemented for all individual modules in order
to guarantee a consistent interface and behaviour also in future developments. Great
115
Chapter 7 – Conclusion and Outline
emphasis was placed on the validation of the cf. Several differently sized simulations were
run to ensure the correct behaviour of all involved modules. The comparisons with the
results of the software version did not show a single deviating bit in more than 2 million
matched clusters.
In addition, the decoding of the input data and with that the basis of all following
processing steps was validated in a real readout system during the test beam campaign
in May 2017. It was also shown that the resources available in the fpga of the cru are
sufficient to realise all components. Thus, the entire preprocessing chain in the Firmware
(fw) for the tpc related crus was implemented and the detector could successfully be
read out. The implementation was achieved well ahead of schedule as the first readout
with a trigger on cosmic particles is anticipated for autumn 2019, when the first part of
the tpc is upgraded.
The goal of this thesis was, to implement the cf for the tpc in the cru and proof that
the fpga is sufficiently large for the needed digital logic. This has been achieved, although
a few details could still be improved. For example could at some point a charge splitting
be implemented in the cf. It was seen that especially in the high occupancy case the
qtot distribution deviates from the ideal one due to overlapping or nearby clusters. By
separating the charges along the minima between the charge distributions, this deviation
can be reduced. The for the splitting necessary flags of the minima are already implemented
but not yet used. Another improvement which can be done in the cf is the peak finding
itself. The current implementation compares the actual adc value of neighbouring pads to
look for rising and falling edges. This method is susceptible to fluctuations and noise. A
different approach would be to compare the difference of these values to a threshold. A
rising or falling edge is only found when the difference exceeds the threshold. This could
suppress small fluctuations as expected from noise and reduce the number of clusters found
incorrectly, improving the overall compression factor.
116
Appendix A
Acronyms
acorde Alice COsmic Ray DEtector
adc Analog-to-Digital Converter
alice A Large Ion Collider Experiment
alm Adaptive Logic Module
asic Application-Specific Integrated
Circuit
bc Bunch Crossing
blc Baseline Correction
cc Clock Cycle
cdc Clock-Domain Crossing
cdf Cluster Data Format
cern Conseil Europe´en pour la Re-
cherche Nucle´aire
cf Cluster Finder
cm Common Mode
cp Cluster Processor
cr Counting Room
cru Common Readout Unit
csa Charge Sensitive Amplifier
ctp Central Trigger Processor
daq Data Acquisition
das Direct ADC Serialisation
dcal Di-jet CALorimeter
dcs Detector Control System
dma Direct Memory Access
dsp Digital Signal Processor
emcal ElectroMagnetic CALorimeter
eop End-Of-Packet
epn Event Processing Node
fec Front-End Card
fed Front-End Device
fee Front-End Electronics
fifo First In, First Out
flp First Level Processor
fmd Forward Multiplicity Detector
fp Fixed-Point
fpga Field Programmable Gate Array
fsm Finite State Machine
fw Firmware
gbt GigaBit Transceiver
gem Gas Electron Multiplier
117
Appendix A – Acronyms
hb Heart Beat
hlt High Level Trigger
hmpid High Momentum Particle IDen-
tification
hw Half-Word
ibf Ion Back-Flow
ip Intellectual Property
iroc Inner Readout Chamber
its Inner Tracking System
le Logic Element
leir Low Energy Ion Ring
lep Large Electron Positron
lhc Large Hadron Collider
ls Long Shutdown
lsb Least Significant Bit
ltu Local Trigger Unit
lut Lookup-Table
m20k 20 kbit Memory Block
maps Monolithic Active Pixel Sensors
mft Muon Forward Tracker
mlab Memory Logic Array Block
mrpc Multi-gap Resistive-Plate Cham-
ber
msb Most Significant Bit
mux Multiplexer
mwpc Multi-Wire Proportional Cham-
ber
oroc Outer Readout Chamber
pcb Printed Circuit Board
phos PHOton Spectrometer
pid Particle Identification
pll Phase-Locked Loop
pmd Photon Multiplicity Detector
ps Proton Synchrotron
psb Proton Synchrotron Booster
qa Quality Assurance
qgp Quark–Gluon Plasma
ram Random-Access Memory
rcc Ring Cathode Chamber
rdh Raw Data Header
roc Readout Chamber
sc Slow-Control
gbt-sca gbt-Slow Control Adapter asic
sdd Silicon Drift Detector
simd Single Instruction, Multiple Data
sm Super-Module
sop Start-Of-Packet
spd Silicon Pixel Detector
sps Super Proton Synchrotron
ssd Silicon Strip Detector
tdr Technical Design Report
tof Time-Of-Flight
tpc Time-Projection Chamber
tr Transition Radiation
trd Transition Radiation Detector
ttc Timing, Trigger and Control
tts Trigger, Timing and clock distri-
bution System
118
ul User Logic
uut Unit Under Test
vhdl Very high speed integrated cir-
cuit Hardware Description Lan-
guage
zdc Zero Degree Craorimeter
zs Zero Suppression
119

Appendix B
Pad Plane Mapping
The mapping of the individual pads to the sampa chip within a Front-End Cards (fecs)
is shown on the following pages for all rocs. One tpc sector is composed of an Inner
Readout Chamber (iroc) and the three Outer Readout Chambers (orocs). Figure B.1
shows the mapping for the iroc, figure B.2 for the oroc 1, figure B.3 for the oroc 2 and
figure B.4 for the oroc 3. Each pad displays the id of the sampa within a fec to which it
is connected to.
Please notice the horizontal straight blue lines in the mappings. Although the blue
boxes indicate in general just which pads are grouped together within a single connector
between the pad plane and the fec (each fec is connected via 4 connectors to the pad
plane), the straight ones also mark the different readout regions. Each of the 10 readout
region is transmitted to and processed by a single Common Readout Unit (cru). With
that, it is ensured, that always a complete pad rows is transmitted to the same cru which
is important for the cluster finding. The straight lines can be grouped into two categories:
1. • between rows∗ 31/32 for figure B.1
• between rows 62/63 for figure B.1 and figure B.2
• between 69/70 for figure B.2 and figure B.3
• between 126/127 for figure B.3 and figure B.4.
2. • between rows 16/17 and 47/48 for figure B.1
• between rows 80/81 for figure B.2
• between rows 112/113 for figure B.3
• between rows 139/140 for figure B.4
The first category has a change in the sampa number when crossing the line. This indicates
a transition to a new fec and is of no further importance. The second category has the
sampa id 2 on both sides of the line and marks the transition to a different cru (region)
within a single sampa chip. Here, the pads are connected to sampa 2 in such a way that
the pads below are always sampa channels 0–15 and above are always sampa channels
∗The row numbers are written always on the right side, next to the pads.
121
Appendix B – Pad Plane Mapping
region partition fecs rows max. pads
0 0 15 17 76
1 0 15 15 84
2 1 18 16 94
3 1 18 15 100
4 2 18 18 84
5 2 18 16 94
6 3 20 16 106
7 3 20 14 118
8 4 20 13 128
9 4 20 12 138
Table B.1: Key parameters of the pad plane regions.
16–31. In this way, even if half of the data of sampa 2 is shipped to one cru und the other
half to a different one, complete pad rows can always be restored.
The number of fecs in the individual readout regions can be read off and varies from
15 to 20. It is summarised together with more key parameters of the individual regions
in table B.1. The partition is the subdivision of a whole sector in radial direction into
units of fecs. Therefore always two regions belong to one partition. The number of fecs
is in pad direction and means the number of neighbouring fecs belonging to the same
partition. There is also the number of rows within each region given, which is important
for the number of individual Cluster Finder (cf) instances since the cluster finding is done
in each row individually. This number ranges from 12 to 18 in the lowest number in the
outermost region and the lowermost number in region four. The last column shows the
highest number of pads which can be found within a single row within this region.
122
Figure B.1: The pad plane of the iroc. Each pad shows the id of the sampa within a
fec to which it is connected to, taken from [34].
123
Appendix B – Pad Plane Mapping
Figure B.2: The pad plane of the oroc 1. Each pad shows the id of the sampa within
a fec to which it is connected to, taken from [34].
124
Figure B.3: The pad plane of the oroc 2. Each pad shows the id of the sampa within
a fec to which it is connected to, taken from [34].
125
Appendix B – Pad Plane Mapping
Figure B.4: The pad plane of the oroc 3. Each pad shows the id of the sampa within
a fec to which it is connected to, taken from [34].
126
Appendix C
The Raw Data Header
Every data package needs to be preceded by a Raw Data Header (rdh), to identify and
efficiently process the data in the First Level Processor (flp). The rdh consists in total of
512 bit, containing all relevant information. Depending on where the header is generated,
either in the Front-End Electronics (fee) and sent via the gbt system, in the fee and
sent via the ddl system or in the User Logic (ul) of the cru, the individual words of the
rdh are differently defined in length and numbering. Once the header it is stored in the
memory of the flp they have the same size and format. The version which is important
for the tpc readout is the one generated in the ul of the cru. It is shown in figure C.1.
The rdh consists of four 128 bit words. The red fields need to be filled by the ul while
the remaining ones are filled by the interface module to the Direct Memory Access (dma)
engine and must be set to zero by the ul. Since they are not of interest for the tpc, only
the definitions of the red fields are given in the following. This is taken from [64].
Priority bit
Field to adjust the priority of the following data. If this is set to 0x1, the packet is
propagated with a higher priority than the others. This can be used e.g. to report a
Heart Beat (hb) frame when the buffers are full.
FEE ID
A unique id, assigned to the fee.
Block length
The size of the payload (without the rdh) in bytes.
Header size
The size of the rdh in bytes (4× 128 bit = 64 B).
Header version
Version number of the header.
HB orbit
Heart Beat orbit.
127
Appendix C – The Raw Data Header
TRG orbit
Trigger orbit.
TRG type
The current trigger type, set by the Central Trigger Processor (ctp).
HB BC
Heart Beat Bunch Crossing.
TRG BC
Trigger Bunch Crossing.
Pages counter
If the data volume of one trigger (or hb) exceeds the maximum of 8 kB, this counter
is used to keep track of the pages belonging to the same trigger.
Stop bit
Bit to identify the last page (set to 0x1, otherwise stays at 0x0) of a series of pages
belonging to the same trigger.
PAR
The Pause-and-Reconfigure field is used by the detector to trigger a synchronous
reconfiguration of the fee.
Detector field
A field for detector specific content.
128
64
80
96
10
4
12
7
re
se
rv
ed
(2
4b
it)
lin
k
id
(8
bi
t)
m
em
or
y
Si
ze
(1
6b
it)
off
se
t
ne
xt
pa
ck
et
(1
6b
it)
wo
rd
0
0
8
16
32
48
56
63
re
se
rv
ed
(8
bi
t)
pr
io
rit
y
bi
t
(8
bi
t)
fe
e
id
(1
6b
it)
bl
oc
k
le
ng
th
(1
6b
it)
he
ad
er
siz
e
(8
bi
t)
he
ad
er
ve
rs
io
n
(8
bi
t)
64
80
12
7
re
se
rv
ed
(4
8b
it)
re
se
rv
ed
(1
6b
it)
wo
rd
1
0
32
63
hb
or
bi
t
(3
2b
it)
tr
g
or
bi
t
(3
2b
it)
64
80
12
7
re
se
rv
ed
(4
8b
it)
re
se
rv
ed
(1
6b
it)
wo
rd
2
0
16
32
63
tr
g
ty
pe
(3
2b
it)
re
s.
(4
bi
t)
hb
bc
(1
2b
it)
re
s.
(4
bi
t)
tr
g
bc
(1
2b
it)
64
80
12
7
re
se
rv
ed
(4
8b
it)
re
se
rv
ed
(1
6b
it)
wo
rd
3
0
16
32
40
56
63
re
se
rv
ed
(8
bi
t)
pa
ge
s
co
un
te
r
(1
6b
it)
st
op
bi
t
(8
bi
t)
pa
r
(1
6b
it)
de
te
ct
or
fie
ld
(1
6b
it)
Figure C.1: The four 128 bit words of the rdh version 3. The red fields need to be
filled by the ul, based on [64].
129

Appendix D
Bibliography
[1] D. Boyanovsky et al. “Phase transitions in the early and the present universe”. In:
Ann. Rev. Nucl. Part. Sci. 56 (2006), pp. 441–500. doi: 10.1146/annurev.nucl.56.
080805.140539. arXiv: hep-ph/0602002 [hep-ph] (cit. on p. 1).
[2] Edward V. Shuryak. “Theory of Hadronic Plasma”. In: Sov. Phys. JETP 47 (1978).
[Zh. Eksp. Teor. Fiz.74,408(1978)], pp. 212–219 (cit. on p. 1).
[3] The ALICE Collaboration. Upgrade of the ALICE Experiment: Letter of Intent.
Tech. rep. CERN-LHCC-2012-012. LHCC-I-022. ALICE-UG-002. Geneva: CERN,
Aug. 2012. url: http://cds.cern.ch/record/1475243 (cit. on pp. 1, 2, 8, 15).
[4] LS2 preparatory meeting. Jan. 2019. url: https://indico.cern.ch/event/776676/
(visited on 01/25/2019) (cit. on pp. 1, 16).
[5] The ALICE Collaboration. Upgrade of the ALICE Time Projection Chamber. Tech.
rep. CERN-LHCC-2013-020. ALICE-TDR-016. Oct. 2013. url: https://cds.cern.
ch/record/1622286 (cit. on pp. 2, 10, 12, 14, 15, 17, 18, 24, 36, 66–68, 72).
[6] Lyndon Evans and Philip Bryant. “LHC Machine”. In: Journal of Instrumentation
3.08 (2008), S08001. url: http://stacks.iop.org/1748-0221/3/i=08/a=S08001
(cit. on p. 3).
[7] Esma Mobs. “The CERN accelerator complex - August 2018. Complexe des
acce´le´rateurs du CERN - Aouˆt 2018”. In: (Aug. 2018). General Photo. url: http:
//cds.cern.ch/record/2636343 (cit. on p. 3).
[8] The accelerator complex. url: https : / / home . cern / science / accelerators /
accelerator-complex (visited on 01/21/2019) (cit. on p. 4).
[9] The ATLAS Collaboration. “Observation of a new particle in the search for the
Standard Model Higgs boson with the ATLAS detector at the LHC”. In: Phys. Lett.
B716 (2012), pp. 1–29. doi: 10.1016/j.physletb.2012.08.020. arXiv: 1207.7214
[hep-ex] (cit. on p. 4).
[10] The CMS Collaboration. “Observation of a new boson at a mass of 125 GeV with
the CMS experiment at the LHC”. In: Phys. Lett. B716 (2012), pp. 30–61. doi:
10.1016/j.physletb.2012.08.021. arXiv: 1207.7235 [hep-ex] (cit. on p. 4).
131
Appendix D – Bibliography
[11] The ATLAS Collaboration. “The ATLAS Experiment at the CERN Large Hadron
Collider”. In: Journal of Instrumentation 3.08 (2008), S08003. url: http://stacks.
iop.org/1748-0221/3/i=08/a=S08003 (cit. on p. 4).
[12] The ALICE Collaboration. “The ALICE experiment at the CERN LHC”. In: Journal
of Instrumentation 3.08 (2008), S08002. url: http://stacks.iop.org/1748-
0221/3/i=08/a=S08002 (cit. on pp. 4, 6, 7).
[13] The CMS Collaboration et al. “The CMS experiment at the CERN LHC”. In: Journal
of Instrumentation 3.08 (2008), S08004. url: http://stacks.iop.org/1748-
0221/3/i=08/a=S08004 (cit. on p. 4).
[14] The LHCb Collaboration. “The LHCb Detector at the LHC”. In: Journal of Instru-
mentation 3.08 (2008), S08005. url: http://stacks.iop.org/1748-0221/3/i=08/
a=S08005 (cit. on p. 4).
[15] The LHCf Collaboration. “The LHCf detector at the CERN Large Hadron Collider”.
In: Journal of Instrumentation 3.08 (2008), S08006. url: http://stacks.iop.org/
1748-0221/3/i=08/a=S08006 (cit. on p. 4).
[16] The TOTEM Collaboration. “The TOTEM Experiment at the CERN Large Hadron
Collider”. In: Journal of Instrumentation 3.08 (2008), S08007. url: http://stacks.
iop.org/1748-0221/3/i=08/a=S08007 (cit. on p. 4).
[17] The MoEDAL Collaboration. Technical Design Report of the MoEDAL Experiment.
Tech. rep. CERN-LHCC-2009-006. MoEDAL-TDR-001. June 2009. url: https:
//cds.cern.ch/record/1181486 (cit. on p. 4).
[18] The ALICE Collaboration. “Performance of the ALICE Experiment at the CERN
LHC”. In: (2014). doi: 10.1142/S0217751X14300440. eprint: arXiv:1402.4476
(cit. on p. 4).
[19] Arturo Tauro. 3D ALICE Schematic RUN2 - with Description. May 2011 (cit. on
p. 5).
[20] J. Alme et al. “The ALICE TPC, a large 3-dimensional tracking device with fast
readout for ultra-high multiplicity events”. In: (2010). doi: 10.1016/j.nima.2010.
04.042. eprint: arXiv:1001.1950 (cit. on pp. 6, 11, 12).
[21] The ALICE Collaboration. “The ALICE Transition Radiation Detector: construction,
operation, and performance”. In: (2017). doi: 10.1016/j.nima.2017.09.028. eprint:
arXiv:1709.02743 (cit. on p. 7).
[22] Andrea Alici. “The MRPC-based ALICE Time-Of-Flight detector: status and perfor-
mance”. In: (2012). doi: 10.1016/j.nima.2012.05.004. eprint: arXiv:1203.5976
(cit. on p. 7).
[23] The ALICE Collaboration. Addendum of the Letter of Intent for the upgrade of the
ALICE experiment : The Muon Forward Tracker. Tech. rep. CERN-LHCC-2013-014.
LHCC-I-022-ADD-1. Final submission of the presetn LoI addendum is scheduled for
September 7th. Geneva: CERN, Aug. 2013. url: http://cds.cern.ch/record/
1592659 (cit. on p. 8).
132
[24] Sabyasachi Siddhanta. “The upgrade of the Inner Tracking System of ALICE”.
In: Nuclear Physics A 931 (2014). QUARK MATTER 2014, pp. 1147–1151. issn:
0375-9474. doi: https://doi.org/10.1016/j.nuclphysa.2014.09.041. url:
http://www.sciencedirect.com/science/article/pii/S0375947414004230
(cit. on p. 8).
[25] The ALICE Collaboration. Technical Design Report for the Upgrade of the ALICE
Inner Tracking System. Tech. rep. CERN-LHCC-2013-024. ALICE-TDR-017. Nov.
2013. url: https://cds.cern.ch/record/1625842 (cit. on p. 8).
[26] The ALICE Collaboration. Upgrade of the ALICE Readout & Trigger System. Tech.
rep. CERN-LHCC-2013-019. ALICE-TDR-015. Sept. 2013. url: http://cds.cern.
ch/record/1603472 (cit. on pp. 8, 17, 26, 27).
[27] M. Tanabashi et al. “Review of Particle Physics”. In: Phys. Rev. D 98 (3 Aug. 2018),
p. 030001. doi: 10.1103/PhysRevD.98.030001. url: https://link.aps.org/doi/
10.1103/PhysRevD.98.030001 (cit. on pp. 9, 80).
[28] Hermann Kolanoski and Norbert Wermes. Teilchendetektoren. Grundlagen und
Anwendungen. ger. 1. Aufl. 2016. SpringerLink : Bu¨cher. Berlin, Heidelberg: Springer
Spektrum, 2016. isbn: 978-3-662-45350-6. doi: 10.1007/978-3-662-45350-6. url:
http://dx.doi.org/10.1007/978-3-662-45350-6 (cit. on p. 9).
[29] The ALICE Collaboration. ALICE Technical Design Report of the Time Projection
Chamber. Technical Design Report ALICE. Geneva: CERN, 2000. url: http://cds.
cern.ch/record/451098 (cit. on pp. 11, 26, 37, 65, 66).
[30] F.V. Bo¨hmer et al. “Simulation of space-charge effects in an ungated GEM-based
TPC”. In: Nuclear Instruments and Methods in Physics Research Section A: Acceler-
ators, Spectrometers, Detectors and Associated Equipment 719 (2013), pp. 101–108.
issn: 0168-9002. doi: https://doi.org/10.1016/j.nima.2013.04.020. url:
http://www.sciencedirect.com/science/article/pii/S0168900213004166
(cit. on pp. 13, 14).
[31] F. Sauli. “GEM: A new concept for electron amplification in gas detectors”. In:
Nuclear Instruments and Methods in Physics Research Section A: Accelerators,
Spectrometers, Detectors and Associated Equipment 386.2 (1997), pp. 531–534. issn:
0168-9002. doi: https://doi.org/10.1016/S0168- 9002(96)01172- 2. url:
http://www.sciencedirect.com/science/article/pii/S0168900296011722
(cit. on p. 12).
[32] TPC wiki. url: https : / / twiki . cern . ch / twiki / bin / view / ALICE / TPC _
Installation (visited on 12/27/2018) (cit. on pp. 16, 93).
[33] Harald Appelshaeuser et al. Readout scheme of the upgraded ALICE TPC. Nov. 2016.
url: https://cds.cern.ch/record/2231785 (cit. on pp. 17, 20, 21, 26, 49).
[34] ALICE TPC Read-Out Upgrade Wiki. url: https://espace.cern.ch/alice-
tpc-cru/_layouts/15/start.aspx#/ALICE%20TPC%20CRU%20Wiki/Mapping.aspx
(cit. on pp. 18, 56, 66, 68, 123–126).
[35] David A. Huffman. “A method for the construction of minimum-redundancy codes”.
In: Proceedings of the IRE 40.9 (1952), pp. 1098–1101 (cit. on p. 21).
133
Appendix D – Bibliography
[36] Khalid Sayood. Introduction to data compression. eng. 4th ed. Morgan Kaufmann
series in multimedia information and systems. Waltham, Mass.: Morgan Kaufmann,
2012, Online–Ressource (1 v. p.) isbn: 978-0-12-416000-2. url: http://proquest.
tech.safaribooksonline.de/9780124157965 (cit. on p. 21).
[37] Lawrence L. Larmore and Daniel S. Hirschberg. “A Fast Algorithm for Optimal
Length-limited Huffman Codes”. In: J. ACM 37.3 (July 1990), pp. 464–473. issn:
0004-5411. doi: 10.1145/79147.79150. url: http://doi.acm.org/10.1145/
79147.79150 (cit. on p. 23).
[38] Arild Velure and Bruno Sanches. SAMPA V3 Specification. Revision 0.2. Sept. 2018
(cit. on p. 29).
[39] P. Moreira et al. “The GBT Project”. In: Proceedings, Topical Workshop on Electronics
for Particle Physics (TWEPP09) (2009), pp. 342–346. url: https://cds.cern.ch/
record/1235836 (cit. on p. 29).
[40] K. Wyllie P. Moreira J. Christiansen. GBTx Manual. Version 0.15. Oct. 2016. url:
https://espace.cern.ch/GBT-Project/GBTX/Manuals/gbtxManual.pdf (cit. on
pp. 29, 31).
[41] Albert X. Widmer and Peter A. Franaszek. “A DC-balanced, partitioned-block,
8B/10B transmission code”. In: IBM Journal of research and development 27.5
(1983), pp. 440–451 (cit. on p. 30).
[42] Alex Kluge and Pierre Vande Vyvre. The detector read-out in ALICE during Run 3
and 4. version 1.5. CERN, EP Department. June 2016. url: https://twiki.cern.
ch/twiki/pub/ALICE/CruHwFwSwDev/ALICErun34_readout.pdf (cit. on pp. 33,
36).
[43] J.P. Cachemiche et al. “The PCIe-based readout system for the LHCb experiment”.
In: Journal of Instrumentation 11.02 (2016), P02013. url: http://stacks.iop.
org/1748-0221/11/i=02/a=P02013 (cit. on p. 33).
[44] E. David et al. CRU Specification. Version 0.7. CERN. June 2016. url: https://
twiki.cern.ch/twiki/pub/ALICE/CruHwFwSwDev/CRU_Specification_v0.7.pdf
(cit. on p. 33).
[45] Intel Arria10 Device Overview. Intel. Apr. 2018. url: https://www.intel.com/
content/dam/www/programmable/us/en/pdfs/literature/hb/arria-10/a10_
overview.pdf (cit. on pp. 34, 58).
[46] Jean-Pierre Cachemiche. Photo of the CRU v1. Centre de Physique des Particules de
Marseille, Apr. 2015 (cit. on p. 35).
[47] P. Buncic et al. Technical Design Report for the Upgrade of the Online-Oﬄine
Computing System. Tech. rep. CERN-LHCC-2015-006. ALICE-TDR-019. Apr. 2015.
url: https://cds.cern.ch/record/2011297 (cit. on pp. 36, 111).
[48] Understanding Metastability in FPGAs. Version 1.2. Altera. July 2009. url: https:
//www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/
wp/wp-01082-quartus-ii-metastability.pdf (cit. on p. 41).
134
[49] GitLab Repository, CRU firmware core. Feb. 2019. url: https://gitlab.cern.ch/
alice-cru/cru-fw (visited on 02/01/2019) (cit. on pp. 43–45).
[50] O. Bourrion et al. Interface between CTS-CRU and CTS-Detector Front Ends Trigger
Notes for Developers. Oct. 2018 (cit. on p. 45).
[51] Recommended Physical RAM for Intel Devices 17.0. url: http://fpgasoftware.
intel.com/requirements/17.1/ (visited on 11/02/2018) (cit. on p. 49).
[52] Intel Quartus Prime Pro Edition User Guide. Intel. Oct. 2018. url: https://www.
intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-
qpp-compiler.pdf (cit. on p. 49).
[53] Dr. Mesut Arslandok. private communication (cit. on p. 49).
[54] Donald Ervin Knuth. The art of computer programming, Volume 3, Sorting and
searching. eng. Second edition (Online edition). Upper Saddle River, NJ: Addison-
Wesley, 1998, Online–Ressource (1 volume). isbn: 978-0-201-89685-5. url: http:
//proquest.tech.safaribooksonline.de/9780321635792 (cit. on p. 57).
[55] R. C. Bose and R. J. Nelson. “A Sorting Problem”. In: J. ACM 9.2 (Apr. 1962),
pp. 282–296. issn: 0004-5411. doi: 10.1145/321119.321126. url: http://doi.acm.
org/10.1145/321119.321126 (cit. on p. 57).
[56] Intel Arria10 Native Fixed Point DSP IP Core User Guide. Intel. Mar. 2017. url:
https : / / www . intel . com / content / dam / www / programmable / us / en / pdfs /
literature/ug/ug_nfp_dsp.pdf (cit. on pp. 61, 62).
[57] Intel FPGA Integer Arithmetic IP Cores User Guide. Intel. Nov. 2017. url: https:
//www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/
ug/ug_lpm_alt_mfug.pdf (cit. on p. 63).
[58] Thomas H. Cormen et al. Introduction to algorithms. MIT press, 2009 (cit. on p. 71).
[59] Ansgar Steland. Basiswissen Statistik. Kompaktkurs fu¨r Anwender aus Wirtschaft,
Informatik und Technik. ger. 4. Aufl. 2016. SpringerLink : Bu¨cher. Berlin, Heidelberg:
Springer Spektrum, 2016. isbn: 978-3-662-49948-1. doi: 10.1007/978-3-662-49948-
1. url: http://dx.doi.org/10.1007/978-3-662-49948-1 (cit. on p. 80).
[60] ModelSim. url: https : / / www . mentor . com / products / fpga / verification -
simulation/modelsim/?sfm=free_form (visited on 01/25/2019) (cit. on p. 93).
[61] H. Engel et al. The C-RORC PCIe Card and its Application in the ALICE and
ATLAS Experiments. Tech. rep. ATL-DAQ-PROC-2014-039. AIDA-PUB-2015-022.
Geneva: CERN, Oct. 2014. url: http://cds.cern.ch/record/1958271 (cit. on
p. 109).
[62] Valgrind. url: http://www.valgrind.org (visited on 01/15/2019) (cit. on pp. 112,
113).
[63] KCachegrind. url: http://kcachegrind.sourceforge.net/html/Home.html
(visited on 01/25/2019) (cit. on p. 112).
[64] ALICE RUN 3 Raw Data Header (RDH V3). Sept. 2018. url: https://docs.
google.com/document/d/1otkSDYasqpVBDnxplBI7dWNxaZohctA- bvhyrzvtLoQ/
(visited on 10/26/2018) (cit. on pp. 127, 129).
135
Appendix D – Bibliography
[65] DeepL Translator. Feb. 2019. url: https://www.deepl.com/translator (visited
on 02/01/2019).
136
Appendix E
Acknowledgments
I would like to take the opportunity to thank my supervisor Prof. Dr. Johanna Stachel
for the support during my doctoral studies, for the great opportunity to work in such an
impressive and challenging field, and for the trust she has placed in me and my abilities.
Actually, I owe her much more. Also because of the internship that I was able to do in her
group back in 2005, I decided to study physics, which had a decisive influence on my life
and why I still have a very special connection to the trd today.
Next I thank Dr. Jorge Mercado for his foresight. Without him, I would never have taken
the step towards digital design which opened up a whole new world for me in understanding
complex electrical systems and detector electronics. Unfortunately he did not stay in
academia long enough to see this thesis completed, but during the time he was still around
he supported me wherever he could.
Further, I would like to thank my colleagues, PD Dr. Kai Schweda for the very interesting
(and always challenging) discussions on various physical and other topics, for the sometimes
more and sometimes less useful advice also in private life and for the detailed proofreading
of this work. I thank Dr. Alexander Schmah for the extensive review of this thesis and
many conversations about the working principles of 521 different gaseous detectors. They
were really fun and advanced also my understanding of the many contributing details.
Many thanks to PD Dr. Yvonne Pachmayer for the help with one or the other reference
as well as the correction of parts of this document. I would also like to thank her for the
many, many chats during the coffee breaks throughout the last years. Thanks also to my
office neighbour Ole Schmidt that he patiently listened to everything I had to get rid of,
for his hints and advice on all the small and big problems I was stuck with.
I would also like to thank my parents with all my heart. They were always there when
they where needed so I could concentrate on this work. And of course I would like to thank
my wife Regina. She took the burden off me wherever and whenever she could, so this
work is just as well her merit.
But above all, I want to thank my son Erik:
Your laughter has brightened up even the hardest times
and you have shown me what is most important in life.
Thank you very much!
137

Erkla¨rung
Hiermit versichere ich, dass ich diese Arbeit selbststa¨ndig verfasst und keine anderen als
die angegebenen Quellen und Hilfsmittel verwendet habe.
Heidelberg, den 3. Februar 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
