Abstract-The LHCb event-builder is implemented using a large Gigabit Ethernet network using a very lean push-protocol for a single stage read-out at 1 MHz event injection rate. Destination assignment and dynamic load-balancing are facilitated by LHCb's Timing and Fast Control system. Assembly of fragments is done on each event-filter farm node instead of having dedicated builder units. The design of the event-builder will be shortly described, followed by a description of the implementation, protocol and performance results. Emphasis will be on experience in running such a large event-building network and the problems which we encountered and how we overcame them.
I. INTRODUCTION

L
HCb's [1] data acquisition is characterised by a quite small event-size after zero-suppression of 35 kB and a high rate of events of 1 MHz. The topography of the detector and different requirements on data-processing result in a relatively high number of data-sources (or front-end links) of approximately 300. Fragments from all readout-boards must reach one of up to 2000 servers are required to keep up with the selection of interesting physics events. The total amount of data makes a Local Area Network the only reasonable option for such a data acquisition. After a brief evaluation of ATM and Myrinet Gigabit Ethernet has been chosen as the link technology.
The following facts are important for the final architecture
• LHCb settled on a common read-out board for all subdetectors 1 , which was designed to be able to send its data via Gigabit Ethernet to the DAQ. This read-out board is called TELL1 [2] .
• The physical installation of the entire system, including the event-filter farm of up to 2000 servers, is in the underground area close to the detector. This allowed to use 1000 BaseT [3] for all connections.
• The small event size per source of only about 100 bytes is only little above the minimal Ethernet frame-size of 64 bytes. To send a 100 byte message on the wire requires in Ethernet 126 bytes (i.e. a 26% overhead!) and adding a minimal useful header required for event-building brings this quickly to 40 or even 50%. the data-flow. The message size is therefore increased by coalescing eventfragment belonging to different triggers. The read-out boards do this and are connected directly to the DAQ network. Originally it had been foreseen to use dedicated eventbuilder PCs to merge the fragments from read-out boards 
Event building
HLT farm
Event data Timing and Fast Control Signals and send them to dedicated farm-nodes for selection. The estimated data-rate for such a PC is 2.4 Gb/s full-duplex, which seemed not quite feasible with PCs available in 2005, the originally anticipated starting date for the system. The event-builder PC was consequently removed and the eventbuilding implemented in each node of the event-filter farm as a dedicated process.
II. ARCHITECTURE OF THE LHCB EVENT-BUILDER
The data-flow in the event-builder shown in Figure 1 is as follows: the detector-front ends send their data to the read-out board. The data-flow on this link is completely deterministic as no zero-suppression is done on the front-end themselves. The buffer-status of the front-ends can thus be centrally emulated to avoid congestion. This is one of the core functionalities of the Timing and Fast Control system (TFC). Thus no flow-control is needed on the individual links.
In the read-out boards the data are zero-suppressed and reformatted for transmission to the DAQ. Event-fragments from several triggers are coalesced into a single message. Since the processing-time on these boards is not deterministic no emulation is possible and the buffer status need to be protected via a dedicated throttle network, allowing individual boards to throttle the trigger.
A. Event-building Protocol
In order to keep the logic on the read-out boards simple, we have designed a very light-weight protocol on top of IPv4, which we call MEP (Multi Eventfragment Packet). A MEP is a datagram which contains up to 16 event-fragments, that is data belonging to 16 triggers. MEP is similar to UDP but does not have its own header check-sum nor does it know the concept of ports. MEP contains three essential information items for event-building 1) The number of event-fragments 2) The event-ID, that is a monotonically increasing number which identifies a event-fragment belonging to a certain trigger 3) The size of the event-fragments Following the OSI model, no information available at the underlying IP layer is repeated, for example in the eventbuilder process on the farm the origin of the data is established by checking the IP source address of the received datagram. A MEP datagram must fit into a single IPv4 packet, resulting in a maximum MEP size of 64 kB.
The destination farm-node is assigned by the TFC system. For this purpose the Readout Supervisor (RS) takes the role of a data-flow manager. The farm-nodes announce their availability to the RS by sending a credit token indicating the number of MEPs, event datagrams, they are willing to accept. After a certain number of triggers the RS will broadcast the IP address of one of the nodes to all the readout-boards. The boards will then send the MEP datagrams to this node without any further delay or back-pressure.
B. Readout Network
It is clear from the preceding discussion, that while the farm-nodes can protect themselves from overload by sending or not sending credits to the RS, the network has no such protection. It has to absorb the full in-rush of packets from all the read-out boards.
Modern Ethernet switches buffer data at the output 2 to avoid head of line blocking at the ingress. In the LHCb architecture at the output however is a computer, the farm-node, which has a single Gigabit connection. So 300 devices send at the same time to the same output destination. Many Ethernet switches react by simply dropping packets, which is allowed by the Ethernet standard . The LHCb event-building protocol does not foresee any re-transmission or traffic modulation, so such a packet loss would be disastrous.
We could identify only a single device, a very large corerouter,capable of sustaining this violent traffic. It is so large in fact that a single unit is sufficient to absorb the traffic from all read-out boards.
Even with 1260 Gigabit Ethernet ports it is not possible to also connect all the required farm-nodes. Since the farmnodes do not require a full Gigabit link edge-switches can be put between the main router and the farm-nodes. Some key figures characterising the LHCb event-builder can be found in Table I .
III. IMPLEMENTATION
The read-out boards, the DAQ network and the event-filter farm are all installed in 3 stories of electronics barracks in the experiment cavern of the LHCb experiment 100 m underground. Each read-out board is equipped with 4 Gigabit Ethernet linksThe number of links actually connected to the DAQ network depends on the expected amount of output data of the board.The destination address of the farm-node is constructed from a constant pre-fix and 12 bits broadcast by the TFC system. All links from the read-out boards are connected to the corerouter directly so that it can absorb the packet bursts in its large egress buffer memory. The core-router is connected to each of the 50 edge-routers via 2 link-aggregation groups (LAG) of 6 links each 3 . Cost prevented the use of a pair of 10-Gigabit connections. The edge-routers have 48 Gigabit Ethernet ports.
In the configuration of the routers it was very important to provide the maximum amount of buffer memory for the output. This can be achieved by minimising the number of priority queues, because typically each queue will absorb a fixed amount of memory.
In the farm-node there is a single process which receives the events from the network and puts them into a shared memory. This process is called MEPRx. The architecture of the tasks on the farm-node is described elsewhere. Since MEP does not use ports MEPRx opens a raw IP socket and registers itself to a special IP protocol number (0xCB). Opening raw-sockets requires normally root privileges, which is inconvenient. In order to avoid this a small patch is deployed on the farmnodes as a kernel-module, which allows giving user processes the right to open raw sockets. The traffic coming to the farmnode is still very bursty. In order not to loose packets in the kernel, in particular during IP re-assembly, we needed to tune the receive memory, several time-out parameters and also the number of descriptors for the DMA rings in the NIC cards. Experience has shown that this tuning has to be re-visited every time a major new operating system version is installed. Typically we have configured 8 MB of receive space and up to 4000 pending packets. With these settings we have not observed losses in the kernel itself. It is important to tune this correctly as unfortunately many of the drops in the kernel are silent, which is perfectly allowed by the IPv4 specifications, since IP is not a reliable protocol.
The MEPRx process does a lot of verification and syntactic checks on the data. It sends a request for new MEPs to the readout-supervisor whenever it has successfully acquired space in the shared memory to received a complete set of MEP datagrams. Back-pressure is implemented via the readoutsupervisor. When the event-data on a farm-node cannot be processed quickly enough or accepted events cannot be sent onwards the buffers in the node will fill up and the MEPRx process will at some point fail to acquire memory and consequently not send a MEP request. When the number of available destinations reaches a low watermark the readout-supervisor, whose main function is the distribution of trigger decisions, will start to throttle the trigger and stop the data-flow into the read-out boards.
Event-building is completed once MEP datagrams from all read-out boards have been received. On start of run several events are requested at once so to avoid idle-time waiting for data. Several events can be built in parallel. This allows coping with late arriving MEP datagrams, even though this should normally not happen in a large enough farm, provided that the distribution among farm-nodes is flat. After a defined timeout has been reached the MEP is declared incomplete and discarded 4 . Also corrupted or truncated events are discarded.
IV. PERFORMANCE The system is designed for handling 1000000 events of 35 kB every second. Typically 10 events are packed into one MEP so that the total data-gram rate from 300 read-out boards in the system at nominal running is 30 × 10 6 , worth 35 GB/s of data. We do not have real collision data from a 1 MHz run, because the LHC up to now does not provide a sufficient number of collisions. Tests have been done using the datagenerator mode of the read-out boards, for high rates.
There are two main aspects to the performance of the system:
• The loss-less transport of data at any rate • The resource-usage in the various system component The loss-rate is currently , which is an improvement over previous numbers. The up-to-now identified reasons for packet-loss and the solutions will be discussed in the next section.
A. Buffer and Memory Usage
As is typical for a push-architecture buffer-sizes increase in the direction of the data-flow. Consequently the output buffer on the read-out boards is only 128 kB. The readoutboards however have a fast asynchronous method of disabling the trigger and with it the input data flow, so this buffer is well protected. Next comes the buffer in the main router. The effective buffer-size available for data is architecture specific and the vendors are usually unwilling to disclose any details. When evaluating suitable devices we have tried to measure the buffer-size with a simple minded approach based on flowcontrol. Experience has later taught us that this method tends to overestimate the effective available buffer space. For our specific router we see 256 MB of shared buffer for a set of ports corresponding to 3.5% of the ports. In each-port set we connect both read-out boards and farm-nodes. Ports connected to read-out boards do not use any buffer because this router uses virtual output queueing. So the shared output buffer is only used by the ports connected to farm-nodes.
In the farm-nodes there are 4 buffers involved.
1) The packet buffers in the Network Interface Card (NIC)
2) The kernel buffers (sk buf structures) which are allocated from the slab allocator and get the data via DMA. Each of them holds data belonging to one Ethernet frame.
3) The socket buffers associated to an application (in this case the MEPRx process). This buffer holds the complete re-assembled MEP datagrams. 4) The shared-memory buffer which MEPRx uses to distribute the event-data to the actual trigger processes Item 1 is a hardware feature and cannot be well influenced, however it is important to make sure that a large number of buffers is configured to be used. We set it to the maximum of our drivers which is around 1000. For item 2 there is no way to directly monitor the occupancy as it comes from a shared kernel pool. Experimentally we see no packet loss with 4000 packets available per network device. At values of 2000 packets losses were still observed. For item 3 again there is no real monitoring possible, but empirically a size of about 6 MB has proved to be sufficient. The buffer in item 4 has to absorb the fluctuation in processing time from the trigger processes and the time it takes to received the MEP data-grams from all read-out boards. We have conservatively chosen it such that it has space for 3 worst-case events. A worst-case event is an event in which each read-out board sends the maximum packet possible, that is 64 kB. The buffer is therefore set to 21 MB. In practice the occupancy is low, unless there is back-pressure from processes lower in the stack. In total one instance of the MEPRx process requires 79 MB of RAM out of which 68 MB is shared memory. These numbers are for a 64-bit implementation using gcc version 4.3.
B. CPU Load
The second resource which is interesting is CPU usage where applicable. The read-out boards are driven by FPGAs so there is no such thing as a CPU. The actual data-formatting for sending over the Ethernet links is very simple and requires little FPGA resources, in particular since the Media Access Controller (MAC) is off-chip. In the network devices the CPU load is independent of the traffic flowing through, however the monitoring of the traffic can cause heavy CPU load, which in turn can lead to performance problems. It is important to take this into account when investigating problems. Fine-grained, high-rate monitoring can have undesired side-effects.
V. PROBLEMS & SOLUTIONS
Apart from trivial programming errors in the code of the event-builder or the readout-supervisor and simple configuration issues in the network devices (such as maximum transmission unit not uniformly set to 9000), the only real problem was and is packet-loss.
Despite initial doubts using an infrastructure all based on unshielded twisted pair (UTP) category 6 (TIA Cat6) cabling has not caused any problems. On more than 2500 links the only hardware problems observed were due to physical damage on the rear of the patch-panels. There are no checksum errors, jabbers or other indications of problems on the linklayer. In the following we will present the many sources of packet loss and how they can be overcome.
A. Network Issues 1) MTU issues: Choosing an as large as possible Maximum Transmission Unit (MTU) is very important for good performance in many layers of the data-flow, in particular however in the receiving farm-node. It also reduces protocol over-heads. Even though an MTU above 1500 bytes is not in any of the Ethernet standards, all enterprise class devices support at least 9000 bytes. We had an unpleasant surprise when we changed our protocol from being based on Layer-2 addresses (i.e. pure switching) to static routing (i.e. packet forwarding based on IP addresses instead of MAC addresses). Packet drop under load was observed but only for packets longer than 1500 bytes. In switching mode the same device had worked perfectly at any frame-size. For this limitation no work-around could be found and this device, which had passed all tests up to then, had to be discarded for the final system.
2) Backplane Usage in Switches: Large switches are usually of the packet switching type. They are not fully connected cross-bars. Consequently a scheduling algorithm determines which groups of ports can exchange data at any give time. The default setting for this is optimised for random traffic, such as found in a full-mesh test scenario, i.e. relatively long. For event-building traffic however all groups of ports have data for one specific output port behind which is the farm-node which is the target for this event. Since buffering is done at the output, and there is a lot of input traffic, it is important that each port-group gets a time-slice rather quickly to offload their packets. Higher scheduling rate here will cause suboptimal back-plane usage. But the total back-plane capacity is an order of magnitude above our needs.
3) Buffer distribution in switches: Cheaper switches, such as our edge-routers implement almost all functionality in a single ASIC. This usually comes with some limitations in the way the buffer memory can be re-organized. For instance, any device destined for professional, data-center use, will support several priorities for traffic. These will be implemented in hardware by queues. In most edge-routers we have encountered it is not possible to attribute all memory to a single queue, which in the case of event-building traffic means that buffer-memory is actually lost. In our aggregation switch, the best we have tested, it was possible to attribute 90% of 512 kB.
4) Link-aggregation: As described above 50 edge routers are between the farm-nodes and the core-router. For the reasons mentioned the edge-routers use 2 link aggregation groups (LAGs) to connect to the core-router. LAGs are defined by the IEEE 802.3ad standard. This standard does not define how the links in a group are used, i.e. what is the criterion to choose a link for a given packet. It does however require that the temporal order of packets be preserved. Normally LAGs are used for performance improvement as well as for increased redundancy. Unfortunately in our setup this has the consequence that while a farm-node is connected with a single Gigabit link to the edge-router, the edge-router receives packets destined to this farm-node over 6 links in parallel. This is 6-to-1 overcommitment, which from a certain eventsize, i.e. a certain number of read-out boards, causes packetdrops in the edge-routers, whose output memory is limited to about 450 kB. In our case this could be overcome by using special link-selection algorithm. In this algorithm the link is chosen based on an arbitrary field in the IP header. We have programmed the read-out boards such that the least significant half-word of the event-ID, which is a strictly monotonically increasing number is used in the IP header of outgoing MEP data-grams. Since this number is always the same for MEP data-grams belonging to the same set and hence destined to the same farm-node, this will result in only one link out of the LAG being used for this specific event-number. In this way the 6-to-1 overcommitment is reduced to 1-1 and packet-loss is avoided. It should be noted that this specific LAG algorithm is quite unique. 5) Layer-2 clock: To our surprise even in the 1-to-1 scenario described above, we still observed losses in the edgerouters, albeit at lower rate. Tests with a traffic-generator confirmed that when a very long train of packets exactly back-toback, i.e. with minimal inter-frame gap, is sent from one input port to only one output port, some of the aggregation routers (not all!) show packet drop. After long debugging we could trace this down to an interesting feature of the IEEE 802.3ab standard. In clause 40.6.1.2.6 the transmit clock frequency is defined to be 125 MHz ± 0.01%. The receiver, which recovers the clock from the transmitter, is required in 40.6.1.3.2 to have tolerance in accepting 125MHz±0.01%. We could show that the clock used to transmit by our main router is 125.007, that is within the tolerance of the Ethernet standard. The edge switch receives the packets without problem, however when transmitting itself it uses a 125 MHz clock. During a long train of packets, this can lead to a loss of a byte and consequently a packet. We solved this by increasing the inter-frame gap used by the main router.
B. Losses in the receiver farm-node -kernel and driver parameters
It has been mentioned that kernel parameters such as the IP fragment re-assembly time need to be set to large values to cope with bursty traffic. This tuning has to be re-visited for every major OS release otherwise packet loss results. In the device driver the IRQ coalescence has to be set such that a maximum of packets is transmitted in one go, since latency is not important in our application, minimising the interrupt rate helps in the overall performance. This needs to counter-balanced by generous buffer parameters. Most of the parameters have to be set at least one order of magnitude larger than the defaults. Even minor releases of updates can have undesired side-effects. After an upgrade of the driver for some of our network cards, we observed frequent packetloss. It turned out that this was due to an in principle rare condition, which was hit frequently in our setup, because of the long bursts of packets which arrive in a very short time. Currently we are using a RHEL5.4 kernel (2.6.18 series) in Ethernet drivers from Broadcom (tg3) and Intel (e1000). Many important driver parameters are of course hardware specific. The tuning of most of these parameters relies on a good understanding of the hard-and software and intuition, because very little detailed monitoring (for example the number of currently used DMA descriptors) is available.
VI. CONCLUSION
The LHCb event-builder embodies an almost ideal pusharchitecture. It is implemented as a large Gigabit Ethernet switching network where the sources are custom-build readout boards and the receivers are standard PC-servers. The network is a two-stage one, with one large core router attached directly to the read-boards and edge-routers connecting to the PCs. An existing synchronous system, the Timing and Fast Control system, is used to distribute destination addresses and implement a simple credit-based load-balancing. The protocol knows no re-transmission, nor does it know dedicated eventbuilder units or an intermediate stage between the network and the read-out boards. The system is thus very lean, but very dependent on the devices downstream of the read-board to be able to support the extremely bursty traffic pattern created by the synchronicity of the sending. Any out of 308 data-grams belonging to one event being lost will lead to the discarding of the entire event.
Stress tests using traffic generators up to 500 kHz of eventrate have shown that it is indeed feasible to build such a system with existing hardware. Numerous sources of packet-loss had to be overcome. The event-loss rate is now very low < 10 −8 . The main strengths of the system are its extremely simple protocol, easy to implement in hardware, and the economy in terms of components, both in number of different kinds of hardware. There are essentially only 5 different device-types. Secondly the system is economical in the overall number of devices since there is no protocol adaptation layer, as the readout boards already send in the final DAQ format and the final event-building is done on each receiver PC.
The main weakness is the dependence on the performance of the core-router, which has to absorb the full inrush and a 300 to 1 overcommitment. Very few, if not only a single, currently available devices can achieve this. They are very expensive and not as perfect as originally hoped for. Experience strongly suggest to use two devices in parallel, even though just looking at the bandwidth available a single router is by far sufficient. For the near future a second core router will be a most reasonable consolidation.
The question if this system can meet the requirements of a planned 40 MHz DAQ for the upgraded LHCb experiment will be subject to an exciting R&D programme.
