The NA48 event-building PC farm by Wittgen, M et al.
14R IUEE TRANSACrIONS ON NUCI.EAR SCIENCE, VOL. 47, NO. 2, APRII. ZOO0 
The NA48’ Event-Building PC Farm 
M. Wittgen‘, A. Peters’, P. Marouelli2 S .  Luitz”,E Bal’, 0. Boyle’, A. Gianoli”1, A. L a c o d ,  B. PanzeG, 
0. Vossnack3. 
21nstitut fiir Physik, UniversitBt Mainz, D-55099 Mainz, Germany. 
?CERN, CH-121 I Geneva 23, Switzerland. 
tprcscnt address: SLAC, Stanford, CA 94309, USA. 
‘IPrcscnt address: Sczione dell’lNFN di Ferrara, 1-44100 Ferrara, Italy. 
Abstruct Before the 1998 iunning period an online PC larm has 
uscdby the experiment[ I]. 
The system includes cuslum designed componcnts fur the 
links between sub-dctcctor electronics and the farm PCs as 
wcll as standard commcrcially available hardware lor the PC 
high went rates (up to 10 kHz) with ngligihlc dead timc 
support lor a varicty of detectors with very widc variation 
in the numbcr of rcadout clianncls 
data ratcs of up to 150 MBytcIs sustained ovcr the beam 
burst 
level-3 filtering and remotc data logging i n  the CBRN 
computcr center 
the collaboration has designcd and built a modular pipclincd 
data How system with 40 MIIz sampling rhtc. The architecture 
combincs custom-designcd components with commcrcially 
available hardware for cost effectiveness and Hexibility. 
To increase the svailvhlo data bandwidth and lo add filtering 
and monitoring capabilities, Ihe original custom-built event 
builder hardware has bcen rcplaced by a larm of 24 Intel 
Pcntiumll based PCs running the Linux opcrating systcm 
during the shutdown bctwcen the 1997 and I998 data taking 
periods. During the dala taking period 1998 the systcm lias 
been succcssfully operalcd raking ca. 70 Terabytc oidala. 
I. INTRODUCTION 
Thc NA48 Collaboration has built a dctcclor to mcasure the 
direct CP violating paramctcr € ’ / E  in neutral kaon decays. The 
cxperimenl facos data-taking challcnges such as:  
themselves, thcir inlerconnecting network and the high speed 
link that provides tlic data transfcr hom lhc cxperimental arca 
tn the laboratory computer ccnter whcrc further selection and 
data archival are perlormetl. 
11. REQUIREMENTS 
The aim of tlic NA48 experimcnt a l  CERN is to measurc the 
direct CP violation parameter d / e  with a prccision of 2 x IO-‘, 
hy comparing thc relative decay rates of long- and short-livcd 
kaon beams into twn neutral and charged pions, respectively. 
Ih miniinizc thc systematic error Lhc experimcnt records all 
four decays concurrently using two simultaneous and ncarly 
collincar bcams[2]. 
In order to collect enough statistics in a running timc ol  
thice years an intense kaon beam is required which generates 
an average reatlout trigger rate u l  approx. 7 klIz during hursts 
including monitoring and calibratinn triggers. To leave room 
lor fiiturc expansion, the design allows up to I O  kHn readout 
rate with ncgligible dead timc. An average cvcnt size of 
15 kByte thus requires a maximum data bandwidth of 150 
MBylels. But even 7 kHz trigger rate, 15 kBytc event size 
and 120 dayslyear running timc with a SO perccnl nvcrall 
cfficiencv of thc accelerator and detector oneration result in 75 
111. DESIGN OVERVIEW support 01 various sub-dctcclors with a vcry widc 
variation in the number olchanncls and aniount of data 
the nccd to asseinhlc all events in “car-real timc to ensurc 
The SPS accclcrator at CERN delivcrs protons to the 
NA48 tareets in snills o l  2.5 s lcneth everv 14.4 s. Additional 
I ” 
data consistency cdibration and guard time pcriods expand the cffcctive spill 
‘The NA4X Collaboration Cnglinri, Cambridge. CERN, Ihbna, 
Edinburgh, Fcrrara, Firenzc, M ~ I ~ , , ~ ,  onay, ~ ’ ~ ~ ~ , ~ i ~ ,  pisa, saclay, 
length scen by the tlalil acquisition systcm to S.0 s, which 
h v e s  a 9.4 s gap between two bursts. This duty cycle allows a 
natural split bctwccn data taking and cvcnt building: during the Sieeen. Torino. Vienna. Wtirsuw 
liurst all sub-detectors sent1 their cvent fragment streams info 
the evcnt builder that accumulate thcm into huffers. The peak 
0018-9499/00$10.00 0 2000 IEEE 
349 
daia rate from all sub-detectors during the burst can rcach 150 
MByteIs. The inter-burst gap is Llicn uscd fur cvent building. 
F ~ O I ~  the beginning, tile NA48 Data Acquisition system 
has been designed as a modular, scalahle datii flow architccture 
hased on custom-designed and commcrcial components. It is 
organizad as a series of dala-push links connected to an evcni 
builder which inergcs eve111 fraamcnta from the individual 
exist or are under development which use a variety of 
iechnologics likc Fibrechannel, OPTOBUS, MATCH, etc., for 
Due io the very tight time schedule for the upgrade of ihe 
NA48 event builder and to thc dilferent need for data transfer 
speed (the link used to transfcr the data io the old cvent builder 
had a soecd of 10 MBvteIs). the collaboration decided io adoni a 
transre[ sPcecls and 160 M B W s .  
~ I 
sub-dctectors inlo complete evenis. Each source continues to 
send daia at maximum ratc until inhibited from the destination 
hy a signal ("XOFF"): it is the duty of the destination to 
guaraniee to be able to accept all dala in transit between the 
issuing of the XOFF signal and the source aciing thereon. 
An important feature of the experiment is that all channels 
are sampled continuously synchronized by a global clock. 
Typically, ihc sub-detectors use custom designed VME or 
Pastbus components for ihis purposc. The trigger circuitry 
detects desired evenls and issues a trigger which results in a 
particular time window being selected. Eacli sub-dcteciors' 
readout controller reads the data bclonging LO that time window 
and passes it in rorm of an cvent fragment IO the event builder. 
At the level of thc PC farm and beyond, only commercial 
hardware is used. This minimizes dcsign and maintenance 
cflort, improves reliabiliiy and allows for casy upgrading as 
new technology becomes available. The PC Farm itsclf scales 
within the limits of the switch backplane bandwidth. 
It should be noted that the data flow system docs not modify 
data with the exccpiion of simple reformatting and that the 
pathways for control and data are compleicly separate. Since 
core data flow only handles fkst datii transport, traditional 
data acquisition tasks like dclector control, configuraiion 
managenienl and error reporting arc hatidled by additional 
subsystems which are loosely coupled to data ilow. Each 
sub-dctector readout systcm is controlled by a sub-dctector 
computer which configures and monitors dala flow hardware 
and embcdded processors but does not participate in data flow 
itsell. 
1v. THE DT16/PCI INTERFACE 
The link froin the sub-detectors to the PC farm is the only 
part in the PC larm Cor which cusiom-made ioterCaces are 
used. During ihe planning of thc systcin thc collaboration 
had the occasion of testing the S-LINK131. an hieh meed link 
.. . 
lower performance iechnology for the physical link. The choice 
has been the DT16-bus, an adapted version of the DT32-bus 
originally developed for the Eurogam Project[4]. 
The DT16-bus is a 16-bit wide parallel bus implemented 
with differential ECL signals over a 20 pair twisled-pair cable. 
The maximum speed is 33 MBytcIs, transferring 1 word of I6  
bits every 60 nsec. Like the original DT32, it is possible to 
have multiple sourccs on a single DT16 bus, arbitration being 
handled by a simple daisy chain mechanism. This allows the 
read-out of multiple data sources wit11 a single PC. 
To use DT16 to connect sub-detectors to PCs, iwo 
custom-made boards have bcen developed. On the PC side, 
a DT16-to-SLink (DT2SL[51) board has been designed. 
This board is a mezzaiiine board which fits on an existing 
S-Link-to-PCI interfacc[6], and acts as a master on the 
DT bus. On the dctector side a DT16-1/0 board has becn 
used in conjunction with thc Data RIO modules used by all 
sub-detectors (DRIO modules based on CES RI08260[7] with 
custom I 1 0  inierfaces[81). 
The driver for the DTl6lPCI hoard has been written by the 
collaboration bascd on a similar development [9]. It is a polling 
uscr-spacc zero-copy driver which maps the S-LINK to PCI 
card into memory and then sets up ihe DMA transfers from the 
card into the PC memory. This requires no interrupts and a data 
rate of I17 MRyiels has been achieved with an infinite packet 
size, i.e. the S-LINK card was sending a test pattern of data 
continuously with no gaps for protocol. 
v. T H E  Pc FARM 
All machines in the farm are industry-standard PCs 
equipped with Intel Pcntiumll processors. The operaiing 
syslem is Linux. To simplify the management of the farm, 
a slandard RcdHat 5.0 [lo] distribution has becn modified to 
support BOOTPITmP booting and diskless operation over 
NFS. 
laboralorv fCDRPC) and 1 PC acts as bootlfile servcr and farm network protocols. 
In fact the S-Link specification does not define the physical 
laycr of the link, but a simple FIFO-like user interfacc on which 
the use of the signals remains independent of the iechnology 
used toimplement the physical link. Themapping oftlic S-Link 
signals to thc protocol used on ihe physical link is left open to 
the link designer, allowing io map it in the most suitable way 
onto the underlying link technology. 
Sevcral iypcs oT link sourcc and destination cards already 
controllcr. 
The SDPCs are sillgle processor machines (266 MHz 
Pentiumll) with 128 MBytc RAM for data buffering, 
a DT16/PCI interface card and a past (100 Mbit/s) Ethemet 
adapter, The B B P ~ s  are c~ual.processor machines, with 192 
MRyte RAM, 18 GBytc SCSI 
card. The 4 CDRPC have 128 MByte RAM and a Fast 
Ethernet card as wcll as an FDDl adapter. The interconnection 
[lisks and a F~~~ 
350 
~- Clock Llistrlbution 
Trigger Distribution 
11 Subdetector PCs 
8 Event Building PCs 




t i  Gigabit Ethernet to Computer Centre 
Figure 1: : Mock diagram olthc connections belwccn rhc dcteclar, the PCfarm and the Central Data Recording facility. 
between the PCs is handled by a Catalyst 5505/SupervisorlIl 
Fast Ethernet switch made by Cisco[lI] with a single 24-port 
100 Mbit/s module providing a total bandwidth of I.2Gbith 
for event building and data rccording. The swilch is of a 
store-and-forivard type which means (hat incoming packets are 
stored in a buffer at the input and then forwarded to the required 
output port when the backplane is available. The technology 
used to connect thc PCs to the switch is 100 Mbit/s Ethernet 
running over unshielded twisted pair cablcs (100baseTX). Very 
long connections are impleincnted using fibcr-optic repeaters. 
Figure 1 shows a scheme of the connections. 
During an SPS burst the SDPCs simply rcceive data through 
the DT16PCI interface and store them in memory. After thc 
burst has completed, each SDPC checks the received data 
block for consistency and sends the number of event fragments 
received to the Parm Control Program (FCP). The FCP then 
partitions the evcnt into M blocks, each block being assigned 
to an EBPC. A dynamic lmd balancing algorithm equalizes the 
data load on the EBPCs. 
All SDPCs then distribute their data to the EBPCs according 
to the FCP's partition decision by means of sender processcs. 
Each sender process maintains a logical conncctioii to a receivcr 
process on an EBPC. If there are N SDPCs and M EBPCs, 
this results in NxM logical conncctions. TCP/IP is used as 
protocol, since it handles all issues of flow control and ensures 
data integrity and completeness. Note that at this stage the cvent 
structure is invisible and there is no correspondence betwccn 1P 
packets and evcnt fragments. 
When a receiver gets data. it stores them in mcmory. An 
event builder task running in every EBPC searches througli lhc 
rcccivetl data blocks and pulls out a list ol  pointers for each 
cvent which identifies its component fragments. The pointers 
are stored in a pointer table in sharcd memory. After the event 
building stage, i t  is then possible to npply further data integrity 
checks or fast filter algorithms 10 the data slored in mcmory. A 
disk writing process then lakes sets of complete pointers from 
the table and writes the complete events to thc local hard disks. 
Each EBPC writcs its share of the burst data. During lhc cvenl 
building and proccssing the chronological order uf the evcnts is 
retaincd. 
Central Data Recording processcs then move completcd disk 
files to thc computer centcr. Data are simply sent back through 
the switch (again using TCP/IP over Fast Ethernct, FDDI and 
Gigabit Ethernet) to the computer center which is ca. 7km away 
from the cxperimcntal arca. As soon as a burst fragment filc has 
bcen transferred successfully it is rlcleted from the EBPC's disk 
Data to the Campuler Center Switched Ethernet Data lo IhsCOmpUtorCenter 
Pigore 2: : Simplified dingnun of event building processes and intcrprocess connections with 3 SDPCs and 2 EBPCs. 
buffer. An automatic retry incchanism asyiichronuusly takes 
care of failed transfers. 
In thc compulcr centcr the burst liagnienl files arc combined 
to cotnplcte bursts, processed by a software filter antl ked to thc 
online reconstruclion progrmn. Evcntually, raw or filtered hors1 
filcs and the rcstills of thc nnliiie rcco~i~tr~iction pass arc writleii 
to Redwood STK tapes using thc computcr ceiitcr’s t a p  robot. 
Figure 2 illuslratcs the cvcnt building software and 
connections in a simplified setup with 3 SDPCs end 2 ERPCs. 
The NA48 setup during tlic 1998 running period was madc of 
11 SDPCs and 8 ERPCs. 
VI. CONCLUSION AND PERFORMANCE 
During the winter I998 shutdowii period the NA48 
collaboration has upgradcd its custom-built cvcnt builder 
hardware to an online PC farm to increase thc available data 
bnndwidth antl to add filtering and monitoring capabilities. The 
new system consists 01 24 coniincrcial PCs ronning the Linux 
oiierating svstctn. The PCs arc intcrconncctcd with full-dunlcx 
iutcrfaced to PCI. Tnking advantagc of thc large total 
bandwidth ol’ the switched Ethernet, the event fragments are 
theti combincd to complete events by a distributed parallel 
went building suftware, written out to disk buffers and scnt to 
the coinputer center via a 7 kin long distance Gigabit Ethernet 
l ink for further pruccssing and storage. 
Arter an installation and setup period of about a month 
the systcm has becn continuously taking data since April 
1998, and has proven to ltavc high performance and is stable 
and rcliablc at data rates uf more than 250 M R y k  per beam 
burst, corrcsponding to more than 16 MAytds (about 1 
GBytclminutc) continuous data load. In the I998 running 
pctiod approx. 70 Terabyte aC physics events have been 
assctnblcd and sent to the CERN compnter centcr. 
VII. ACKNOWLEDGEMENTS 
Wc would like tu thank all our colleagues and 1ec.hnical staff 
in the collaboration for all their erforts, help and commitment 
during the upgradc of thc system stid tlie data taking period. . .  
100 Mbit/s Etlicrnet using a switch. 
[ront.ends 
into dedicated PCs using 33 MBylels parallel ECL links 
The M d n z  group was supportcd in part by the German 
Federal Minister for Rcseerch and Technology (BMBF) under 
‘Ontract 7MZ18P(4)-TpZ. 
Event fragmcI1ts arc ,ta~lSferIed from thc 
352 
VIII. REFERENCES 
,1, E Bal et ‘,The NA48 Acal,isilion Svstem3., lEEE 
[51 F. Bal, A. Lacourt, “DTZSL, DTlh to S-Link PCI Inicrface 
User Manual”, EP-Division intcrnal note, CERN, Geneva, 
L .  . 
Switzerland. 
161 Incas Computers, P.0. Box 122, 7300 AS Apeldoorn, Thc 
Netherlands. 
[71 CmaliveElectronic System S.A., 705tedu Ponl-Butin, P.O. 
BOX 107, CR-1213, Petil-Lancy I, Genevc, Swilaedand. 
[XI I? Hal, A. Lacourl, “RIO Tic. RIO Dat Intcrfaccs”, ECP- 
Division internal note, CERN, Geneva, Switzerland. 
1.91 A. Cistemino, “Gcncric AMCC S.5933 Linux Devicc 
Driver”, hltp:/lpcapel.pi.inin.il/ .?cistcrldevldriver.html. 
Red Hat Soflware Inc., P.O. BOX 13588, RTP, NC 27709, 
Trans. Nucl. Sci. vol. 45 (1998) 1889-1893. ’ 
[21 C. D. Barr et al., CERN/SPSC/90-22/P253 
131 0. Boyle et al., “The S-LINK Interface Specification”, 
~ ~ p . ~ i ~ j ~ i ~ ~ ~  CERN, ceneva, ~ ~ j t ~ ~ ~ l ~ ~ ~ l ,  27 ~~~~h 
1997, http:llw~~w.cern.cl~lhsils-l~n~spec. 
H. C. van der ~ i j ,  q.Link, a D~~~ ~ i ~ k  lnlerpace for rhc 
LHC ~ ~ ~ 1 3 ,  ~ E E BT ~ ~ ~ ~ .  N ~ ~ / .  sCj, 44 (1997) 398.402, 
141 J. ),lcxander, “ E ~ ~ ~ ~ ~ ~  project: ~ ~ 3 2 ~ ~ 3 ~  ~ ~ ~ ~ i f i ~ ~ , i ~ ~ ~ ~ ,  
~~~l~~~ physiscs support G ~ ~ ~ ~ ,  c{J<~, D~~~~~~~~ 
Laboratories, Daresburv, Warrine(on, Cheshire, UK w ~ 4  
4AD, United Kingdom. IISA .. 
11 11 Cisco Systems Inc., 170 Wcsl Tasman Dr., San Jose, CA 
95134. USA. 
