On the integration of high performance ATM-based event builders by Bizeau, C et al.
On the Integration of High Performance ATM-based Event Builders 
C. Bizeau, M. Costa, J.-P. Dufey, M. Letheren, A. Pacheco, C. Paillard 
CERN, 121 1 Geneva 23, Switzerland 
D. Calvet, P. Le DCI, I. Mandjavidze 
CEA Saclay, 91191 Gif-sur-Yvette W E X ,  France 
M. Weymann, A. Wiesel 
Creative Electronic Systems, Geneva, Switzerland 
Abstract 
A first demonstrator has shown encouraging results for the 
use of Asynchronous Transfer Mode (ATM) switching net- 
works in the implementation of high performance parallel 
event builders. Our present goal is to show that the integration 
of event builders, includmg the implementation of the source 
and destination functions and the bandwidth adaptation to the 
switching network, can be realized with commercially availa- 
ble products and that good performance can be achieved. We 
shall review some critical issues, present results from perform- 
ance measurements and analyse the overheads. 
I. INTRODUCTION 
An event builder implementation provides high perform- 
ance, in our definition, if a) it ensures that a significant fraction 
of the available network bandwidth is effectively used for data 
transport and b) that this is achieved for the smallest possible 
event fragments. 
The overheads due to the protocol layers required to access 
the network link can be large and it is easier to reach high per- 
formance when transferring large blocks of data. This fact 
leads event builder designers to, for instance, group fragments 
from many events. Another consequence is that usually the 
mize and match the performance of the various components of 
the event builder: the switching network itself, the network 
adapters, the accretion of data packets from the front end data 
buffers in the sources and the transfer of assembled events to 
the analysis processors. We have investigated extensively the 
performance of ATM switching networks together with the net- 
work adapters. We are currently evaluating solutions, based on 
commercially available components, for efficient data transfers 
between front end modules within the sources and for data 
delivery to the analysis processors. 
We restrict this presentation to the discussion on how to 
achieve high data flow performance. Considerations on effi- 
cient supervising functions, such as the destination assignment 
or the distribution of information from the Level 1 trigger, are 
bandwidth of 155 Mbit/s at the physical layer (SONET/SDH). 
The effective bandwidth is 150Mbit/s at the ATM level and 
135 Mbit/s or 16.8 Mbyteh at the user data level. 
nc* +yl&j. For A'># wn c~p+-J~r Q F ! ~  S F 6 - 1  1 ; ~ l -  A + h  0 
11. GENERIC EVENT BUlLDER MODEL. 
PERFORMANCE REQWREMENTS 
The data flow structure of a generic event builder is pre- 
sented in Figure 1. The implementation of a complete event 
Luullvi d ihc event builder is impkmakd U~I d xpudiz net- 
work. In contrast, the performance of ATM networks is very 
high for small data packets, as well as for larger ones. On an 
STM-1 link (155 Mbids), single cells carrying 48 bytes of 
information can be delivered every 2.7 ysec, corresponding to 
a maximum frequency of 370KHz. This includes the opera- 
tions of the AIM layer protocol (header handling, traffic shap- 
ing) and of the &&5 protocol (segmentation &d reassembly) 
which are implemented in hardwa. 
This capability of ATM allows to envisage different archi- 
tectures for event builders: the events can be built individually, 
independently of the event fragment size, it is possible to use 
the switching network to transport the event builder control 
messages and more ambitious systems can be conceived where 
event building is part of a phased event selection process [ll. 
I ATM SwitchFg Network I 
In order to benefit from the high pedormance of ATM net- 
works, in particular for small messages, it is necessary to opti- Figure 1: Data flow structure of a generic event builder. 
0-7803-3534-1197 10.0001 997IEEE 
157 
building system, in its simplest form, i.e. the “push” architec- 
ture, requires that data are collected from one or several Read- 
Out Buffers (ROB) in the sources before being sent as an event 
fragment through the network via the Network Interface (NI). 
The destination, in turn, submits the full event to a processor in 
a farm for analysis. The main data flow is from the sources 
towards the destinations. Only control messages are sent in 
reverse direction, in particular when point to point flow control 
and reception acknowledgements are required (transport proto- 
col). 
In the “pull” architecture, the destination initiates the trans- 
fer of data by the sources, one of the processors in the farm 
being in charge of the selection and/or analysis of an event. The 
control messages issued by this processor to request data from 
the sources are routed by the switching network. 
A .  Characteristics and performance requirements 
in the case of push architecture 
The maximum size of an event fragment F B y t e ]  depends 
on the available link throughput of user data T [Mbyte/sl and 
the trigger rate f [I(HZl: 
F = k * T / f  (1) 
k is the load factor of the link (0 <k 5 1) which takes into 
account the fact that the performance of the network adapter 
may be limited for packets of size F, or that a limitation may be 
imposed to avoid congestion in the switching network. It is not 
safe to use a value of 1, even with very efficient network adapt- 
ers and perfect traffic in the switch. On the other hand, it is 
desirable to use as much as possible of the link bandwidth. 
The full size (NXN) of the event builder, assuming equal 
fragments on all ports, is then determined by the full event size 
E [Mbytel: 
N = E / F  (2) 
The chart in Figure 2, established for T=16.8 Mbyte/s (maxi- 
mum user data throughput on ATM links at 155 Mbit/s) pro- 
vides an estimate of N, howing the event size E and the trigger 
frequency f. The maximum event fragment size (for k = 1) is 
indicated in parentheses with each frequency value. As an 
example (dotted line in the chart) events of 1 MByte at a rate of 
1 KHz and with k = 0.6 require an event builder of 100 X 100. 
Thus an event fragment size of 10 KByte is needed in order to 
efficiently load the switch. 
In the sources, event fragments have to be built by accretion 
of sub-fragments, the size of which depends on the distribution 
of data in the read-out buffers. Although this is completely 
dependant on the particular DAQ system, one can nevertheless 
formulate a general requirement for the bus that links n ROBS: 
its throughput, measured for the transfer of sub-fragments of 
size of lh of the event fragment size, should be at least equal 
to the network link throughput. This is quite a stringent condi- 
tion when the sub-fragments are just of a few hundred bytes. It 
should be noted that this problem of accretion is not specific to 
,QM and that in all cases the bus linking the front-end buffers 
loo0 I 
Aggregate 
100 user throughput 
KByte 10 [Gbiffs] 
\ 100 KHz (1 68 Byte) 







1 KHz (16.8 KByte) 
l o  1 \ L z ( l 6 8  KByte) 
Figure 2: Event builder size as a function of event size 
and trigger frequency. 
should have better performance than the switching network 
link itself. 
Full events are assembled in a destination at a rate f/N. We 
assume that each destination has a farm of Np processors. Np 
depends on the analysis time per event, t [msecl: Np 2 t * f/N. 
The rate at which events are submitted to a processor is - f / (N * Np). In the example, assuming 1 =/event and 100% 
utilizabon of: the CPU, NP = 10. lhe sub-network suppomg 
the processor farm must have approximately the same band- 
width as the network link. 
B. Characteristics of an event builder with pull 
architecture and phased selection 
In the case of pull architecture and phased event selection, 
each processor receives an event to manage at a frequency 
f / (N * Np) and may, for instance, request of the order of 1 or 2 
data fragments per event (e.g. - 1-2 B y t e  or less) for the first 
phase of selection. The network interface of the destination 
collects and distributes the event data needed for the fist phase 
at a frequency -f / N. As an example, in the ATLAS architec- 
ture C ,  the design frequency is 100 KHz, N = 256. The fre- 
quency at a destination is 400 Hz and, with a 10 processors 
farm, events are assigned to a single processor at a rate of 
40 Hz. The processor will request full event building for the 
events that have passed successfully all the earlier phases, but 
this occurs at a frequency much lower than f / (N * Np). 
C. Critical points regarding performance 
ments may be difficult to achieve are: 
The critical points where data flow performance require- 
the switching network that routes the main data 
streams and possibly interleaves control messages. 
Depending on the traftic, it may be necessary to 
reduce the load to avoid congestion. 
158 
the sources and destinations where the implementa- 
tion of concurrent data flows can be challenging. 
the NIs (Network Interfaces) where software over- 
heads can reduce the efficiency of utilization of the 
link bandwidth. 
In. PERFORMANCE OF LINK ADAPTORS 
The ATM standard includes an adaptation layer with several 
protocol options designed for the traffic characteristic of differ- 
ent applications. For transfer of data, the protocol, called 
AAL5 is defined for blocks with variable size, up to 64 KByte. 
On the transmission side the AALS protocol specifies that a 
trailer with a CRC is added and that the data block is seg- 
mented in fixed size ATM cells. On the receiving side reassem- 
bly of the original data block and CRC check are performed. 
The ATM layer is in charge of the cells and routes them 
according to their virtual connection identifier. Many virtual 
connections can be active simultaneously in a single NI. At 
reception, cells are sorted out according to the virtual connec- 
tions so that reassembly at AALS level can OCCUT. ATM does 
not provide a transport protocol. Corrupted packets are 
detected, but retransmission is not performed as it is not 
required in every application. If needed, it has to be provided in 
the upper layer. 
Commercial chip-sets provide hardware implementations of 
the ATM and AALS protocol layers. The complex operations 
oi rouring. sgmeniation and reassembly art: performed in a 
very efficient way with negligible overheads. In addition the 
chip sets supports standard rate policing services, one of 
which, Constant Bit Rate (CBR), allows the implementation of 
an efficient traffic shaping scheme by means of rate division. 
Larger overheads originate in the higher layers, on top of the 
ATM and AALS layers: the optional transport protocol layer 
and the event building protocol which provides event fragment 
identification, their assembly into events (event fragments 
arrive in unpredictable order) and determines when an event is 
completed. A short description of our implementation of the 
event protocol layer and the data structures can be found in [21. 
In a first implementation of a Network Adaptor we have 
developed an ATM interface and studied the best performance 
that could be achieved [2]. The event building software runs 
without operating system on a MIPS WOO0 at 25 MHz. When 
sending or receiving AALS packets, in the absence of the event 
building protocol layers, the software overhead per packet is 16 
ysec and i s  independent of the packet size. When the event 
building layer is added, the overhead. measured on the receiv- 
ing side, where it is highest, is 25 ysec. Small packets can be 
received at a frequency of 40 KHz. For larger packets, the fre- 
quency is determined by the transfer time. 
Figure 3 shows the performance measured on a commercial 
ATM interface from CES [3]. Based on the chip MCStar from 
IDT [4]. it is a PCI mezzanine card (PMC) on a RI02 board 
which implements a PowerPC 604 at 100 MHz [5]. We have 





\ .  
” 
32 64 128 256 512 1024 
packet size [Bytes] 
Figure 3: Transmission throughput and rate for the CES- 
ATh48468 adaptor with “zero-copy” driver 
under Lynx-OS. 
developed a “zero copy” driver under LynxOS with the aim of 
reaching the best possible performance at AAL5 level. In order 
to minimize the overheads due to the operating system, the 
driver checks asynchronously for the completion of a packet 
transfer or the arrival of a new packet instead of using inter- 
rupts. The overhead per packet is 10 ksec (not including the 
event building protocol). 
We found that a transport protocol was not needed in an 
event builder, at least on small systems. However one cannot 
exclude that it may be needed in specific applications. We have 
developed and tested a simple transport protocol and measured 
its efficiency. It implements window based flow control and 
sends acknowledgments tor all packets received. Wn the Wst 
system tested (based on MIPS R3000). we measured an 
increase of the overhead of some 50 ksec, part of which is due 
to the transmission of the acknowledgement message. 
The development of optimized drivers is feasible in a few 
man-months. Presently it is unavoidable if high performance is 
required. Considering that the inefficiencies of the commercial 
drivers are well recognized and that efforts to improve their 
performance have been undertaken (see for instance 1611, it is 
reasonable to assume that faster commercial drivers will be 
avdable in the future. 
IV. PFRFORMANCE OF ATM SWITCHING 
NETWORKS FOR EVENT BUILDING TRAFFIC 
Latest results from our event builder demonstrator setup 
have been presented in [71. We summarize the main points. 
We have shown that event building traffic, in push architec- 
ture, can use a large fraction of the available aggregate 
throughput without any data loss on A l l 4  switches with 8 
ports. As traffic shaping we use the rate division scheme pro- 
vided by the CBR implementation in the S A R  chip. Figure 4 
shows the performance measured on a demonstrator using an 8 
ports switch from Bell Labs. Eight traffic generators send event 
fragments of variable size (gaussian distribution) to 8 destina- 
tions. The aggregate throughput is 12OMByte/s, i.e. % of 



















0 512 1024 1536 2048 
Mean event fragment size (bytes) 
Figure 4: 8 X 8 event building; push data flow; gaussian 
packets larger than 700 bytes. An event building frequency of 
30 KHz can be achieved for event fragments up to 5 12 bytes. 
Simulation studies show that the rate division traffic shaping 
might be sufficient to reach high loads on larger switches based 
on various technologies. As an illustration, Figure 5 shows the 
















0 400 800 1200 
Occupancy in bytes (x) 
memory in a switching element of a 256 X 256 
Alcatel type switch, 70% load. 
Figure 5 :  Probability for the occupancy of the shared buffer 
probability curve (tail distribution) for the buffer occupancy in 
an Alcatel type switch [SI with 256 ports. lhe switching ele- 
ments are 16x16 and have a shared buffer memory of 
256 bytes. Consequently the probability to overflow a buffer is 
very small for a load of 70%. In the latest implementations 
even larger buffer sizes are provided. These good results are 
obtained under the assumption that the NIs in the sources are 
not synchronized, which is expected, each NI being an inde- 
pendent module with its own internal clock. 
Figure 6 shows the performance of a demonstrator operating 
in pull mode. In this configuration full duplex links are used. A 
destination requests event fragments by sending short mes- 
sages to the sources. The results are shown for 6 sources and 1 
destination. The values are a lower limit of the possible per- 
formance because the traffic generators used as sources cannot 
handle more than 1 request at a time. Nevertheless the results 
show a good performance for this architecture. 
2 0 ,  . . . . . . . . . . . . . . 1 , I  20 
max user throughput 
15 
10 
d l  0 
Mean event fragment size (bytes) 
Figure 6: Performance of a “pull” event builder 
(6 sources and 1 destination). 
The results from the demonstrators and from simulation 
studies are encouraging. The rate division traffic shaping mini- 
mizes the congestion probability, provided that the s o m s  
have random time correlation. 
v. CoMBnvED TRAFFIC IN SOURCES AND 
DESTINATIONS 
So far we have discussed the performance of the event 
builder assuming that the event fragments were already availa- 
ble in the source memory and discarding events in the destina- 
tion 2: smn 3s they *:ercl n~lnzzYnrl. We next consider 
solutions for the complete data transfer from the read-out buff- 
ers to the source network link on one hand and from destina- 
tion link to the analysis processors on the other hand. 
A .  Source modules 
We have seen that, in general, event fragments have to be 
built by accretion of smaller blocks of data located in several 
read-out buffers connected to a single source module by a local 
link. In order to achieve the best perfonnance, the bandwidth of 
this link has to ?x at !east of the stme order as the network link 
bandwidth, and must provide good perfonnance for data blocks 
of the size delivered by the ROBs. At present we can envisage 
VME and PCI as standard local links. PCI seems to offer the 
best performance characteristics, however its limited physical 
range restricts its use to a small number of ROBs (typically up 
to 4). VME offers a bandwidth large enough to match with 
ATM STM-1 links. Its limitation is due to large overheads for 
the initialisation of block transfers. 
We have measured the performance of a source module 
using VME to link the ROBs. As source module we have used 
a CES RTPC board 191 with the ATM 8468 interface from CES 
131. The ROBs were emulated by a single slave board (in our 
case a HC 8234 from CES) from which a variable number of 
data blocks were transferred into the source module memory. 
The RTPC uses a 100 MHz PawerPC 604 with a second level 
cache of 512 KByte. The VME interface is connected to the 
PCI bus of the RTPC. Block transfer between the slave mem- 
160 
ory and the source memory is performed by means of the 
Block Mover Accelerator hardware controller (BMA) which 
also provides for chained block transfer driven by a list of 
descriptors stored in the RTPC memory. 
We used the VME and ATM drivers provided by CES under 
LynxOS. They offer asynchronous access to VME and ATM 
(at the AAL5 level) thus allowing concurrent transfers on both 
links. The test program consisted of 2 threads, one for VME 
read-out and one for the transfer of event fragments on the 
ATM link, each one passing control to the other once it has ini- 
tiated a transfer. 
The results are shown in Figure 7. The chained block trans- 
140 









256 5 i 2  lk 2i< dK si< 16K 3iK 
Event fragment size (user data) [Byte] 
Figure 7: Throughputs for ATM, W E  and simultaneous 
fer in VME is very efficient and no significant throughput vari- 
ation is observed if the event fragment is composed of 1 single 
block or several blocks (up to 8). The measured overhead when 
starting a chained block transfer is about 120 psec plus 5 psec 
for each subsequent block. It is possible to implement a simpler 
version of the VME driver if higher performance for small 
fragments is required. 
The PCI option is being evaluated in a separate project that 
implements and tests a PMC-based solution for the ATLAS 
Read-out Buffers [lo1 
transfer in the CES RTPC board 
B. Destination modules 
The problem of data transfer to the analysis processors is in 
principle not difficult: it is relatively easy to achieve high per- 
formance for the large blocks formed by complete events. 
However, an additional overhead is imposed by the fact that the 
event fragments, of variable length, arrive in unpredictable 
order and a copy operation is necessary in order to store the 
event in a contiguous buffer. 
One or two Fast Ethernet ports could be sufficient to carry 
the traffic of a destination to the processors. In the example of a 
1 MByte event with 1 sec analysis time, the transmission time, 
is of the order of 0.1 sec. This is compatible with the use of 
TCP/IP which is difficult to avoid for Ethernet and commercial 
UNIX workstations. We have measured a bandwidth occupa- 
tion as high as 80% on a 10-Base isolated Ethernet link with 
TCP/IP packets of 4 KBytes, under Lynx-OS. 
VI. CONCLUSION 
An important result is that the performance challenges are 
not so much in the switching network as in the network inter- 
face and in the connection with the rest of the system. It is also 
clear to us that commercial products deliver good hardware 
performance but that efforts are required to improve the soft- 
ware which is not designed for small messages on high 
throughput links. We believe that event builders with good per- 
formance can be implemented, based on currently available 
commercial components. 
VII, REFERENCES 
D. Calvet et al., “A Study of Performance Issues of the 
PiIZAS Event Selection System based on an ATM Switch- 
ing Network”, in IEEE Transactions on Nuclear Science, 
vol. 43, No 1, February 1996, pp. 90-98. 
M. Costa et al., “Results from an ATM-based Event 
Builder Demonstrator”, in ZEEE Transactions on Nuclear 
Science, vol. 43, Num. 4, August 1996. 
Creative Electronic Systems SA Geneva, ATM 8468, PCI- 
AI+M Mezzanine Card, DOC 8468/pG, Version 1.0, May 
IDT Inc., Santa Clara, CA, USA, IDT77201 NICStAR 
chip, User Manual Vers. 2.0, November 30,1995. 
Creative Electronic Systems SA Geneva, RI02 8060, Pow- 
erPC based RISC U0 Board, Technical Manual vers. 1.0, 
DOC 8060/UM, October 1995. 
T. v. Eicken et al., “U-Net: A User-Level Network Inter- 
face for Parallel and Distributed Computing”, in Proc. of 
the 15th ACM Symposium on Operating Principles, Cop- 
per Mountains, Colorado, December 3-6,1995. 
M. Costa et al., “Lessons from ATM-based event builder 
demonstrators and challenges for LHC-scale systems”, in 
Proceedings of the 2nd Workshop on Electronics for LHC 
Experiments, Balatonfured, Hungary, 23-27 September, 
1996. 
M. Henrion et al., “Technology, Distributed Control and 
Performance of a Multipath Self-Routing Switch”, in Pro- 
ceedings of the XIVth International Switching Symposium, 
Yokohama, Japan, October 1992, vol. 2, pp. 2-6. 
Creative Electronic Systems SA Geneva, RI02 8067LK, 
PowerPC Single Board Computer, Technical Manual vers. 
2.0, DOC 8067LWM, May 19%. 
1 M L  
- * -  - .  
[10]0. Gachelin et al., “ROBIN: A Functional Demonstrator of 
the ATLAS TriggerDAQ Read-out Buffer”, in Proceed- 
ings of the 2nd Workshop on Electronics for LHC Experi- 
ments, Balatonfured, Hungary, 23-27 September, 19%. 
161 
