The Online Error Control and Handling of the ALICE Pixel Detector by Caselle, M et al.
The Online Error Control and Handling of the ALICE Pixel Detector 
M. Casellea,b, A. Klugea, C. Torcato De Matosa 
 
a CERN, CH-1211 Geneva 23, Switzerland  
b Università Degli Studi di Bari, I-70126, Bari, Italy 
 
On behalf of the Silicon Pixel Detector Project 
 
michele.caselle@cern.ch 
 
Abstract 
The SPD forms the two innermost layers of the ALICE 
Inner Tracking System (ITS) [1]. The basic building block of 
the SPD is the half-stave, the whole SPD barrel being made of 
120 half-staves with a total number of 9.8 x 106 readout 
channels. Each half-stave is connected via three optical links 
to the off-detector electronics made of FPGA based VME 
readout cards (Routers). The Routers and their mezzanine 
cards provide the zero-suppression, data formatting and 
multiplexing and the link to the DAQ [2] system. This paper 
presents the hardware and software tools developed to detect 
and process any errors, at the level of the Router, originating 
from either front-end electronics, trigger sequences, DAQ or 
the off-detector electronics. The on-line error handling system 
automatically transmits this information to the Detector 
Control System and to the dedicated ORACLE database for 
further analysis. 
I. INTRODUCTION 
The SPD status and performance can be affected by a 
variety of hardware malfunctions, such as perturbations or 
failures in the cooling or power supply systems, 
Single/Multiple Event Upset or Single Event Transients, 
degradation of optical connections, wrong front-end or back-
end configurations, faulty trigger and timing sequences from 
Central Trigger Processor (CTP) [3], spurious/missing 
signals, DAQ optical link not ready, etc.  
To detect and manage these anomalous conditions a new 
system named “error handling system” has been developed 
and fully integrated in the readout firmware and control 
software. It consists of hardware and software tools to detect 
and process errors at the level of the Router originating from 
the SPD subsystems. Errors are sent to the attention of the 
operator and are displayed as alarms in the Detector Control 
System user interface.  
A statistical errors analysis (histograms, cross-
correlations, etc.) of the different error types can be done 
using the ORACLE database to evaluate the main error 
sources in the SPD hardware. This will allow monitoring the 
SPD stability over the lifetime in the ALICE experiment. 
The error detection system was thoroughly tested in the 
integration lab using final system components and was then 
implemented in the ALICE experiment. This paper presents 
the hardware and software tools developed in order to 
recognize and process errors in the SPD. The first operation 
experience in the experiment is also reported. 
II. OVERVIEW OF THE SILICON PIXEL DETECTOR 
The ALICE experiment at LHC is designed to investigate 
high-density strongly interacting matter in nucleus-nucleus 
interactions. In order to provide high granularity tracking 
information close to the interaction point in this high 
multiplicity environment, the two innermost layers of the 
ALICE detector are made out of Silicon Pixel Detector (SPD). 
It consists of two barrels at radii 3.9 and 7.6cm from the 
interaction point of hybrid pixel cells of dimensions 50µm 
(rΦ) x 425µm (z) that cover a total surface of 0.24m2. The 
requirements in radiation hardness and the challenging 
material budget and dimensional constraints have led to 
specific technology developments and novel solutions. The 
LV power supply requirements for each half-stave are 1.85V 
@ 5.5A for the front-end chips and 2.6V @ 0.5A for the 
MCM, the total power dissipation for SPD is about 1.5kW. 
The cooling system is based on an evaporative system with 
C4F10. The SPD can provide a trigger input signal to the 
ALICE Central Trigger Processor (CTP) using the built-in 
Fast-OR functionality, in each chip, an electric pulse is fired 
whenever a hit is detected in a cell. 
 
Figure 1: The out-layer of SPD detector and Half-stave view  
The following section gives an overview of the ALICE 
Silicon Pixel Detector with major emphasis on the on-detector 
and off-detector electronics 
A. Half-Stave and on-detector electronic 
The main components of each half-stave are two silicon 
pixel sensor (ladders) glued and wire-bonded [4] to the low 
601
mass Al-polyimide multi-layer flex (pixel bus), which at one 
end is attached to a Multi-Chip Module (MCM).  
The ladder [5] is an assembly of a silicon sensor matrix of 
256 x 160 cells bump-bonded to five readout front-end chips. 
The front-end pixel chip ALICE1LHCb [6,7] is an 
analog/digital mixed-signal ASIC produced in commercial 6 
metal layer 0.25µm CMOS process, made radiation tolerant 
by the design layout. It contains 8192 cells, arranged in 256 
rows x 32 columns.  
The MCM contains four radiation tolerant ASICs 
developed at CERN in a commercial 0.25μm CMOS process: 
the Digital Pilot [8], the Analog Pilot, the RX40 [9] and the 
GOL (Gigabit Optical Link) [10, 11]. It also contains an 
optical transceiver (a ST-Microelectronics custom 
development) containing 2 pin diodes and a 1300nm laser 
diode. The connection between the off-detector readout 
electronics and each half-stave is made via three optical fiber 
links: one link for the LHC@40MHz clock, one for the serial 
trigger, control and configuration signals and one 800 Mbit/s 
G-link for the data transmission from the detector. The half-
stave bock diagram is shown in figure2. 
 
Figure 2: Half-Stave block diagram 
 The Digital Pilot performs the readout of the 10 
ALICE1LHCb pixel chips and the formatting of the readout 
data. The GOL receives the readout data from the Digital Pilot 
on at 40MHz, 16bit bus and serializes them in an 800Mb/s G-
Link compatible stream. The Digital Pilot also broadcasts the 
clock and controls all ASICs presents on the half-stave in 
according to the commands received from the control room by 
“serial data” optical fiber. It is connected to the PIN diodes in 
the optical package and a RX40 chip convert these command 
in LVDS signals. The Analog Pilot provides the voltage 
references for the ALICE1LHCb pixel chips and monitors 
voltages and temperatures on the half-stave. 
B. Off-detector electronic (Router and LinkRX) 
The off-detector electronics consists of 20 VME FPGA-
based processor modules (Routers), each carrying three 2-
channel link receiver (LinkRx) daughter-cards, one Detector 
Data Link (DDL) and a trigger/timing receiver chip (TTCRx) 
[12]. The main processor on the 10-layer motherboard is a 
1020 pins chip Altera Stratix EP1S30. One Router fully 
equipped is shown in figure 4.  Each FPGA-based mezzanine 
Link Receiver card (LinkRX) serves two half-staves. It 
receiver the trigger signals and configuration patterns from 
the Router and propagate it to the half-staves. The readout 
chain of a LinkRX is shown in figure 3. During the readout 
phase the pixel data stream from the half-staves is de-
serialized by an Agilent HDMP1034 device [13], the received 
data is checked for format errors (described in the next 
section) and the data are stored in a buffer-FIFO, then zero-
suppressed, encoded, re-formatted in the ALICE DAQ format 
[14] and written to a dual port memory. 
 When all data from one event are stored in the dual port 
memory the link receiver asserts event ready flag to be read 
out by Router processor.  
 
Figure 3: Link Receiver block diagram  
The Router receives the trigger control signals from the 
ALICE Central Trigger Processor (CTP) through the on-board 
TTCrx chip and forwards the trigger commands to the pixel 
detector. In the Router FPGA the L0 signal, L1 signal, L1 
message, L2 message are decoded. 
 
Figure 4: Router full equipped with three LinkRXs, one DDL card 
and one TTCRx chip.  
The ALICE trigger has three levels (L0, L1 and L2) 
whereas the SPD system uses L1 and L2 triggers only. The 
ALICE1HHCb pixel chips provide binary hit information, 
which is stored in a delay line during the L1 decision time. In 
case of a positive L1 decision the hit is stored in one out of 
four multi-event buffers where the data wait for the L2 
decision to be read out or discarded. After reception of the 
positive L2 decision, the Router starts to check the event 
ready flag in the status register of the link receivers. When an 
event ready flag appears the Router processor reads the data 
from the link receiver dual port memory. The Link receiver 
also asserts to the Router processor the error flags, that are 
identified in the data stream coming from detector, as 
described in the next section. Each Router sequentially reads 
one event from each of the link receiver channels in order to 
merge data coming from 6 half-staves and labels them with 
trigger and status information to build one Router sub event. 
The sub events of each of the Routers are sent to the ALICE-
DAQ system through the ALICE detector data link (DDL).  
The data access for the on-detector electronic control and 
configuration is performed via the router VME-interface. The 
router converts the data to JTAG compatible commands 
which are sent to the detector through the optical links with a 
maximum data rate of 5 Mbit/s.  
602
C. Control System 
The operation of the ALICE SPD requires the on-line 
control and monitoring of a large number of parameters. This 
task is performed by the SPD Detector Control System 
(DCS). It is based on a commercial Supervisory Control And 
Data Acquisition (SCADA) named PVSS. Five PVSS projects 
run independently on different working nodes to control, 
respectively the cooling system, the Power Supply (PS) 
system, the interlock and monitor system and the FE 
electronics; the fifth project links together and monitors the 4 
subsystem projects. The interface between the PVSS and 
VME Router racks is done by Front End Device (FED) 
servers a C++ custom standalone application. 
III. ON-LINE ERROR CONTROL AND HANDLING 
A dedicated on-line error handling system, consisting of 
hardware and software tools, has been developed to detect and 
manage any anomalous conditions arising from possible 
malfunctions in the various SPD subsystems. Error flags and 
information are notified to the operator and are displayed as 
alarms in the Detector Control System user interface. In 
addition, two bits in the Alice data format Common Data 
Header (CDH) [14] are used to inform the Experiment 
Control System ECS [15] that one anomalous condition is 
present so that, according to the ECS-DAQ policy, the event 
data taking can be stopped when a predefined number of 
errors are detected.  
All error conditions are divided in classes; at each class 
one error level is associated. The error levels are divided in: 
fatal, error and warning. The fatal level condition is 
asserted when the trigger sequence is not coherent, or the 
event data taking shows inconsistencies, or a severe 
malfunction is detected in a half-stave. In this case a bit is set 
in the CDH in order to notify the ECS-DAQ system. The 
error level is asserted when a wrong condition is detected in a 
half-stave, or in the on-detector or off-detector electronics, but 
the purity of the data taking remains acceptable. The warning 
level is used to inform the operator that an error condition is 
likely to arise. The typical example is when the temperature of 
a half-stave increase towards the threshold limit. 
The error message is sent in an error block. The error 
block formatting is shown in figure 5. It consists of 4 words 
(32 bit) that contains all information necessary to identify 
both the errors typology and in which hardware part of the 
SPD is affect. The error messages include the timing 
reference information such as bunch and orbit number in 
order to identify the events in which the errors have been 
detected.  
 
Figure 5: Error data format (error block) 
The new subsystem error handling architecture integrated 
in the SPD system is shown in Fig. 6. It consists of a software 
and a hardware layer.  
 
Figure 6: Error handling architecture 
All error conditions detected at different hardware levels 
are captured and identified by additional Finite State 
Machines, implemented in the LinkRx and Router FPGAs, 
that complement the off-detector data handling. The errors are 
formatted as shown in figure 5 and stored in a Single Port 
Memory (SPM) located on the Router board. The Router sets 
the “new errors present” flag on VME bus. The Front-End 
Device (FED) polls periodically the VME bus; when the error 
flag is detected all error blocks are read from Single Port 
Memory. The use of a Single Port Memory for storing and 
reading out the errors is needed to separate the errors readout 
logic from the main data taking process. The error blocks are 
recorded in the ORACLE local database together the actual 
“Run Number” and error timestamp. The FED propagates an 
error flag to PVSS to warn the operator. Together the full 
error description also the corrective action, in order to put the 
detector in a proper status, is sent to the operator. The use of 
the database to store all errors allows to keep the entire errors 
log in the SPD. This is fundamental for the future statistical 
studies. 
A. Software layer 
The software layer consists of one low and one high tier. 
The low tier is a driver written in C++ added in the Front End 
Device (FED) server. It establishes the communication with 
the hardware units (Routers) and transmits the error 
information to the dedicated ORACLE Database. The local 
database is made in a smart structure able to store and execute 
the first error data elaboration faster. For each errors class one 
action on detector can be done in order to re-establish the 
proper SPD status. This is done also at level of database by 
means of a dedicate look-up table.  
The high tier software layer consists of a custom 
application written in the Alice Supervisory Control and Data 
Acquisition (SCADA) system named PVSS. This application 
allows at the operator to receive both the error message and 
the error duration, in fact the hardware implementation is able 
603
to evaluate if the errors condition is still present or has 
disappeared. A statistical errors analysis of the different error 
types can be done using the database. 
 
Figure 7: PVSS error handling User Interface 
In figure 7 is shown the graphic user interface developed 
in the PVSS SCADA environment. The database queries 
allow to select errors details refereed at different runs, 
different Routers or in base at the errors classes. 
B. Hardware layer 
The hardware tools for error detection consists of two 
different stages implemented in Verilog modules that were 
added to the standard off-detector components in the LinkRX 
and Routers handling the data acquisition. All error 
information is processed at 40 MHz. 
 
Figure 8: Router FPGA firmware block diagram 
The first stage “Detection stage” is used to identify the 
possible error types in the SPD system, e.g.: optical 
connection status and data format errors, front-end and back-
end errors/status, SEE (Single Event Effect), wrong trigger 
sequences or missing/spurious trigger signals, etc. The second 
stage is used to handle and transmit the error information to 
the SPD Front-End-Device server (FED) by VME bus.  
The first stage consists mainly of an ad-hoc Finite State 
Machine designed to capture any anomalies in the different 
hardware levels. More than 3200 potential error topologies 
have been identified in the full SPD. When an error condition 
is found in the LinkRX modules, it asserts to the Router 
processor the error flags than will be processed in the second 
stage. The error classes defined in the LinkRX modules 
coming from pixel chip and MCM are: idle violation, Glink 
down error, Glink transmission error, Single Event Upset 
(SEU), control error, control detector feedback error and 
control pixel error. The anomalous conditions coming from 
LinkRX readout modules (see figure 3) are: FIFO overflow, 
memory overflow. Busy violation is asserted when a 5th L1 
trigger signal has been received by the on-detector electronics, 
although all (4) multi event buffers were full and the 
corresponding busy signal (which has been sent to the trigger) 
has been active. Idle violation is asserted when a L2 signal 
(either L2y or L2n) has been received by the on detector 
electronics although no corresponding L1 signal has been 
received. Glink down error is asserted when the data link 
was down during the event read out. The Glink transmission 
error is asserted when Glink receiver found an error in 
transmission protocol during the readout of the corresponding 
event. SEU error is asserted when it was detected and was 
not recovered by the on-detector electronics. The control 
error is asserted when the MCM has not recognized one 
command. All control signals sent to the detector (L1, L2y, 
L2n, test signal, JTAG signals) are sent back on the fast link 
for error detection. The control detector feedback error is 
active if one of the signals sent to the detector was not 
received back between the precedent and the actual event read 
out. The control pixel error is asserted when error occurred 
on the pixel chips ALICE1LHCb. The FIFO overflow is 
asserted when at least one of the pixel converter readout 
FIFOs was full at least once during the data read out. The 
memory overflow is asserted when at least one of the pixel 
converter readout memories was full at least once during the 
data read out. All this errors are considered as “fatal” and the 
error information is sent to the DAQ together with event data 
in DAQ header [14]. 
The errors class defined in the Router FPGA main 
processor allows to find anomalous condition coming from 
trigger signals (CTP), state machine inside Router FPGA, 
wrong alignment between half-stave reference clock and LHC 
bunch number, data format, wrong operation/configuration 
during the “Start of Run” sequencer, FastOr signals not 
coherent in the data format or missing or noisy, half-stave 
temperatures close to the functionality threshold limit and 
more. The trigger signals and the messages are checked, 
aligned and stored in the trigger FIFO inside Router FPGA. 
The trigger errors occurred when the trigger level arrive in a 
not logical way or bad timing or in case of a spurious or 
missing signal. In case of a trigger error this is considered as 
“fatal” and the information is sent to the DAQ system in DAQ 
header [14]. The FastOr signals generated from the pixel 
chips are synchronous with the SPD reference clock. In order 
to keep a coherence between the FastOr signals generated and 
the bunch crossing and orbit number is important to check, 
during the “Start of Runs” ECS sequence, the alignment 
between SPD clock and bunch in orbit. When this alignment 
is not present, an error flag is set. This error is considered as 
“error”, the operator receiver the associate error class and 
details but no information is sent to DAQ system. Also the 
FastOr setting is checked by special state machine that look 
604
the consistency between the hits present in the pixel matrix 
and the relative FastOr signal, this allows to find both missing 
or noisy FastOr signals. Moreover, the half-stave temperatures 
are constantly monitored by Routers, if the threshold limit is 
reached the interlock signal is sent to the power supply. In 
fact the efficient cooling is vital for this very low mass 
detector. In the case of a cooling failure, the detector 
temperature would increase at a rate of 1 °C/s. 
The second stage “Handling stage” consists of several 
modules that handle the errors signals coming from the first 
stage. The logical operation are: to order in base at the priority 
level, to format in the error block shown in figure 5 and store 
in the error FIFO (see figure 8). A special architecture has 
been implemented in order to process errors that coming at 
40MHz. The signals errors generated in the first stage are 
collected by a module so-called “Error Manager”. Usually one 
error condition generates a cascade of secondary errors in 
both LinkRX and Routers that will also be registered by the 
error detection hardware units. The Error Manager is based on 
a priority encoder logic used to select both the error entity and 
the order of arrival, in this way the hardware unit is capable to 
distinguish between the original error and secondary effects 
and will flag the cause of the problem. The logic diagram of 
the second stage is show in the following figure. Moreover, 
the Error Manager executed the error formatting. 
 
Figure 9: Error Manger logic diagram 
Once the errors are stored in the FIFO, they are transferred 
to the Single Port Memory, and arbitration is used to manage 
the Single Port Memory in both write and read mode during a 
VME access. When all blocks error are stored in the memory 
the “new error present” flag is set to inform the FED server. 
All operations are controlled by dedicate two Finite State 
Machine. 
IV. INTEGRATION AND COMMISSIONING 
The first prototype of the on-line error handling system 
described here has been intensively tested and fully qualified 
in the laboratory by emulation of the error patterns generated 
at 40MHz. The on-line error handling system has been fully 
integrated and tested in the experiment. The test and 
integration was focused on the compliance with the overall 
ALICE system (CTP and DAQ) during both ECS sequences 
“Start of Run” and “End of Run”. Off-line statistical studies 
are carried out in order to monitor the SPD stability during 
operation in the experiment. 
V. REFERENCES 
[1] ALICE Collaboration, ALICE Technical Design 
Report of the Inner Tracking System, CERN/LHCC 99-12, 
ALICE TDR 4. 
[2] http://ph-dep-aid.web.cern.ch/ph-dep-aid/ 
[3] http://epweb2.ph.bham.ac.uk/user/krivda/alice/ 
[4] M. Caselle et al.,Nucl. Instrum. Methods A 518 297 
(2004). Proceeding of the 9th Pisa Meeting on Advanced 
Detectors, La Biodola, Isola d’Elba, Italy May 25-31, 2003. 
[5] P. Riedler et al. “Recent test results of the ALICE 
silicon pixel detector”, Proceedings of the VERTEX 2003 
conference,   NIMA 549 (2005) 65-69. 
[6] K. Wyllie, et al., Front-end pixel chips for tracking in 
ALICE and particle identification in LHCb, Proceeding of the 
Pixel 2002 Conference, SLAC Electronic Conference 
Proceedings, Carmel, USA, September 2002. 
[7] R. Dinapoli, et al., An analog front-end in standard 
0:25mm CMOS for silicon pixel detectors in ALICE and 
LHCb, CERN-2000-010, Proceedings ofthe Sixth Workshop 
on Electronics for LHC Experiments, Krakow, Poland, 
September 2000, p. 110. 
[8] A. Kluge et al. “The ALICE on-detector pixel PILOT 
system – OPS”, proceedings of the seventh on electronics for 
LHC experiments, Stockholm, Sweden, Sept 2001, 
CERN/LHCC/2001-034, p.95.  
[9] F. Faccio et al. “RX40 An 80Mbit/s Optical Receiver 
ASIC for the CMS digital optical link”, Reference and 
Technical Manual, CERN, October 2001.  
[10] P. Moreira, et al., A 1.25 Gbit/s Serializer for LHC 
Data and Trigger Optical links, Fifth Workshop on 
Electronics for LHC Experiments, CERN/LHCC/99-33, 29 
October 1999, p. 194.  
[11] P. Moreira et al. “G-Link and Gigabit Ethernet 
compliant Serializer for LHC Data Transmission”, NSS-MIC 
2000 , Lyon, France , 15 - 20 Oct 2000 - pages 9/6-9/9 (v.2). 
[12] J. Christiansen, A. Marchioro, P. Moreira, T. Toifl, 
TTCrx Reference Manual, A Timing, Trigger and Control 
Receiver ASIC for LHC Detectors, CERN EP/MIC 
http://ttc.web.cern.ch/TTC/ 
[13] Agilent Technologies, Agilent HDMP-1032, HDMP-
1034, http://www.semiconductor.agilent.com, 5968-5909E 
(2/00). 
[14] R. Divià, P. Jovanovic, P. Vande Vyvre, Data Format 
over the ALICE DDL, CERN/ALICE internal note, 
ALICEINT- 2002-010 V 2.0. 
[15] http://alice-ecs.web.cern.ch/alice-ecs/alice_title.htm 
 
605
